AI Builder Pricing

AI Builder Pricing — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Cumulus (software)

    Cumulus (software)

    Cumulus is a digital asset management software designed for client/server system which is developed by Canto Software. The product makes use of metadata for indexing, organizing, and searching. == History == Cumulus was first released as a Macintosh application in 1992, and was named by Apple Computer as the "Most Innovative Product of 1992". Cumulus introduced search capabilities beyond those available in the Macintosh at the time, particularly relating to thumbnails. Cumulus 1.0 was a single-user product with no network capabilities. Among the main features of Cumulus 1.0, the search function automatically generated previews and contained support for the included AppleTalk – Peer-to-Peer – network. Cumulus 2.5 was available in five different languages and received the 1993 MacUser magazine Eddy award for "Best Publishing & Graphics Utility". In 1995, Canto introduced the scanner software "Cirrus" to focus on the development of Cumulus. Cumulus 3, released in 1996, introduced a server version for the first time and contained the possibility to spread files over the Internet via the "Web Publisher". Since Apple offered Cumulus 3 with its "Workgroup Server" as a bundle, Cumulus became one of the leading digital asset management systems. Cumulus 4 was the first version that was network-ready, and was available for Macintosh, Windows and UNIX operating systems allowing for cross-platform file sharing. Released in 1998, the support of Solaris was discounted later. Cumulus 5 modified the software core to use an open architecture providing an API to external systems and databases. The open architecture of Cumulus 5 also enabled a more functional bridge between Cumulus and the Internet. Cumulus 6 introduced Embedded Java Plugin (EJP) which allowed system integrators to build custom Java plug-ins in order to extend the functionality of the Cumulus client. Cumulus 6.5 marked the end of the Cumulus Single User Edition product, which was licensed to MediaDex for further development and distribution. Cumulus 7 was introduced summer of 2006. Cumulus 8 was released in June 2009, with new indexing capabilities taking advantage of multicore/multiprocessor systems, and ability to manage a wider variety of file formats. Cumulus 8.5 was released in May 2011. Support was added for multilingual metadata, sometimes referred to as "World Metadata." Cumulus Sites was updated to support metadata editing and file uploads. Cumulus 8.6 was released in July 2012, and contains an updated user interface for the administration of Cumulus Sites and additional features for web-based administration of Cumulus. Other additions include features for collaboration links, multi-language support and automated version control. Cumulus 9 was released in September 2013 and introduced a new Web Client User Interface and the Cumulus Video Cloud. The Cumulus Web Client UI was redesigned to provide users with a modern, easy-to-use interface to support and guide the user while addressing modern business needs. The Cumulus Video Cloud extends the Cumulus video handling capabilities to add conversion and global streaming. Cumulus 9 also saw the addition of upload collection links which allow external collaborators to drag and drop files directly into Cumulus without needing a Cumulus account. Cumulus 9.1 was released in May 2014 and introduced the Adobe Drive Adapter for Cumulus which allows users to browse and search digital assets in Cumulus directly from Adobe work environments such as Photoshop, InDesign, Illustrator, Premier and other Adobe applications. Cumulus 10 (Cumulus X) was released July 2015 and introduced two mobile-friendly products: the Cumulus app and Portals. The Cumulus app on iOS was designed to allow users to collaborate either on an iPhone or iPad. Portals is the read-only version of the Cumulus Web Client where users can work with assets that admins allow. Cumulus 10.1 was introduced in January 2016 and included the InDesign Client integration where users can work with Adobe InDesign while accessing their assets from Cumulus. Cumulus 10.2 was introduced in September 2016 and brought the Media Delivery Cloud using Amazon Web Services (AWS). It allows users to manage their media rendition in a single source and distribute media files globally across different channels and devices. Cumulus 10.2.3 was released in February 2017 and came with a "crop and customize photos" feature for Portals and the Web Client. == Product overview == The cataloging of the file via upload into the archive is where Cumulus transfers maximum information about the file from the metadata. For image or photo files, this is typically Exif and IPTC data. The metadata is mainly used to search the archive. The use of embargo data supports license management for copyrighted material. The managed files can be cataloged and their usage can be set. The indexing is based on a predefined taxonomy, which is governed by the internal rules of the organization or by industry standards. You can specify whether files can only be used for specific purposes or only by certain groups of people. The production management system includes version management for files. Via the publication function, the files can be distributed directly via links or e-mails. It's also possible to access from the outside via the Cumulus Portals web interface, which allows a read access to released content from the catalog. There are different variants, starting with the "Workgroup archive server" up to the "Enterprise Business Server" for large companies. Both server and client are extensible through a Java-based plug-in architecture. Since version 7.0, there is a web application based on Ajax with a separate user interface. For access to the Cumulus catalog on mobile, there has been an application for Apple devices based on iOS since 2010. == Miscellaneous == In 2015, Cumulus developer Canto established the first Canto digital asset management (DAM) event. The event is held annually in Berlin. The Henry Stewart team has been hosting DAM conferences since 2006.

    Read more →
  • Mark Keane (cognitive scientist)

    Mark Keane (cognitive scientist)

    Mark Thomas Gerard Keane (Irish: Marcus Ó Cathain, born 3 July 1961, Dublin, Ireland) is a cognitive scientist and author of several books on human cognition and artificial intelligence, including Cognitive Psychology: A Student's Handbook (8 editions, with Michael Eysenck), Advances in the Psychology of Thinking (1992, with Ken Gilhooly), Novice Programming Environments (1992/2018, with Marc Eisenstadt and Tim Rajan), Advances in Case-Based Reasoning (1995, with J-P Haton and Michel Manago)., Case-Based Reasoning: Research & Development (2022, with N Wiratunga). == Education == Keane received a B.A. in Psychology from University College Dublin in 1982. He then received a Ph.D. from Trinity College Dublin in 1987. He then moved to postdoctoral positions in Queen Mary University of London and the Open University. == Academic career == He was a Lecturer in Psychology at Cardiff University. He became a lecturer in Computer Science at Trinity College Dublin in 1990, and became a fellow in 1994. Keane moved to become Chair of Computer Science at University College Dublin in 1998. In 2006, he was seconded to Science Foundation Ireland as Director of ICT, overseeing on a $700m research investment. He advised the Irish Government on its 3.7B euro Strategy for Science, Technology & Innovation (SSTI). From 2006 to 2007, he was Director General of Science Foundation Ireland before returning to University College Dublin where he was appointed VP of Innovation & Partnerships (2007-2009). Keane's research has been split between cognitive science and computer science. His cognitive science research has been in analogy, metaphor, conceptual combination and similarity. His computer science research has been in natural language processing, machine learning, case-based reasoning, text analytics and explainable artificial intelligence. He has been a PI in the Science Foundation Ireland funded Insight Centre for Data Analytics working on digital journalism and digital humanities. More recently, he was deputy director of the VistaMilk SFI Research Centre that is exploring precision agriculture in the dairy sector.

    Read more →
  • Best AI Code-review Tools in 2026

    Best AI Code-review Tools in 2026

    Looking for the best AI code-review tool? An AI code-review tool is software that uses machine learning to help you get more done — it can save you hours every week by automating repetitive work. Most options offer a generous free tier, with paid plans unlocking higher limits, faster processing, and team features. Whether you are a beginner or a pro, the right AI code-review tool slots into your workflow and pays for itself fast. This guide breaks down the top picks, their pros and cons, and who each one is best for.

    Read more →
  • Powerset construction

    Powerset construction

    In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) that recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs. The construction, sometimes called the Rabin–Scott powerset construction (or subset construction) to distinguish it from similar constructions for other types of automata, was first published by Michael O. Rabin and Dana Scott in 1959. == Intuition == To simulate the operation of a DFA on a given input string, one needs to keep track of a single state at any time: the state that the automaton will reach after seeing a prefix of the input. In contrast, to simulate an NFA, one needs to keep track of a set of states: all of the states that the automaton could reach after seeing the same prefix of the input, according to the nondeterministic choices made by the automaton. If, after a certain prefix of the input, a set S of states can be reached, then after the next input symbol x the set of reachable states is a deterministic function of S and x. Therefore, the sets of reachable NFA states play the same role in the NFA simulation as single DFA states play in the DFA simulation, and in fact the sets of NFA states appearing in this simulation may be re-interpreted as being states of a DFA. == Construction == The powerset construction applies most directly to an NFA that does not allow state transformations without consuming input symbols (aka: "ε-moves"). Such an automaton may be defined as a 5-tuple (Q, Σ, T, q0, F), in which Q is the set of states, Σ is the set of input symbols, T is the transition function (mapping a state and an input symbol to a set of states), q0 is the initial state, and F is the set of accepting states. The corresponding DFA has states corresponding to subsets of Q. The initial state of the DFA is {q0}, the (one-element) set of initial states. The transition function of the DFA maps a state S (representing a subset of Q) and an input symbol x to the set T(S,x) = ∪{T(q,x) | q ∈ S}, the set of all states that can be reached by an x-transition from a state in S. A state S of the DFA is an accepting state if and only if at least one member of S is an accepting state of the NFA. In the simplest version of the powerset construction, the set of all states of the DFA is the powerset of Q, the set of all possible subsets of Q. However, many states of the resulting DFA may be useless as they may be unreachable from the initial state. An alternative version of the construction creates only the states that are actually reachable. === NFA with ε-moves === For an NFA with ε-moves (also called an ε-NFA), the construction must be modified to deal with these by computing the ε-closure of states: the set of all states reachable from some given state using only ε-moves. Van Noord recognizes three possible ways of incorporating this closure computation in the powerset construction: Compute the ε-closure of the entire automaton as a preprocessing step, producing an equivalent NFA without ε-moves, then apply the regular powerset construction. This version, also discussed by Hopcroft and Ullman, is straightforward to implement, but impractical for automata with large numbers of ε-moves, as commonly arise in natural language processing application. During the powerset computation, compute the ε-closure { q ′ | q → ε ∗ q ′ } {\displaystyle \{q'~|~q\to _{\varepsilon }^{}q'\}} of each state q that is considered by the algorithm (and cache the result). During the powerset computation, compute the ε-closure { q ′ | ∃ q ∈ Q ′ , q → ε ∗ q ′ } {\displaystyle \{q'~|~\exists q\in Q',q\to _{\varepsilon }^{}q'\}} of each subset of states Q' that is considered by the algorithm, and add its elements to Q'. === Multiple initial states === If NFAs are defined to allow for multiple initial states, the initial state of the corresponding DFA is the set of all initial states of the NFA, or (if the NFA also has ε-moves) the set of all states reachable from initial states by ε-moves. == Example == The NFA below has four states; state 1 is initial, and states 3 and 4 are accepting. Its alphabet consists of the two symbols 0 and 1, and it has ε-moves. The initial state of the DFA constructed from this NFA is the set of all NFA states that are reachable from state 1 by ε-moves; that is, it is the set {1,2,3}. A transition from {1,2,3} by input symbol 0 must follow either the arrow from state 1 to state 2, or the arrow from state 3 to state 4. Additionally, neither state 2 nor state 4 have outgoing ε-moves. Therefore, T({1,2,3},0) = {2,4}, and by the same reasoning the full DFA constructed from the NFA is as shown below. As can be seen in this example, there are five states reachable from the start state of the DFA; the remaining 11 sets in the powerset of the set of NFA states are not reachable. == Complexity == Because the DFA states consist of sets of NFA states, an n-state NFA may be converted to a DFA with at most 2n states. For every n, there exist n-state NFAs such that every subset of states is reachable from the initial subset, so that the converted DFA has exactly 2n states, giving Θ(2n) worst-case time complexity. A simple example requiring nearly this many states is the language of strings over the alphabet {0,1} in which there are at least n characters, the nth from last of which is 1. It can be represented by an (n + 1)-state NFA, but it requires 2n DFA states, one for each n-character suffix of the input; cf. picture for n=4. == Applications == Brzozowski's algorithm for DFA minimization uses the powerset construction, twice. It converts the input DFA into an NFA for the reverse language, by reversing all its arrows and exchanging the roles of initial and accepting states, converts the NFA back into a DFA using the powerset construction, and then repeats its process. Its worst-case complexity is exponential, unlike some other known DFA minimization algorithms, but in many examples it performs more quickly than its worst-case complexity would suggest. Safra's construction, which converts a non-deterministic Büchi automaton with n states into a deterministic Muller automaton or into a deterministic Rabin automaton with 2O(n log n) states, uses the powerset construction as part of its machinery.

    Read more →
  • Automatic acquisition of sense-tagged corpora

    Automatic acquisition of sense-tagged corpora

    The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises. == Existing methods == Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically. WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD. The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are: acquisition by direct Web searching (searches for monosemous synonyms, hypernyms, hyponyms, parsed gloss' words, etc.), Yarowsky algorithm (bootstrapping), acquisition via Web directories, and acquisition via cross-language meaning evidences. == Summary == === Optimistic results === The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word-sense disambiguation. Some results are certainly encouraging: In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique and the examples taken from the ODP Web directories. In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved. It has been shown that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2. Web examples made a significant contribution to the best Senseval-2 English all-words system. === Difficulties === There are, however, several open research issues related to the use of Web examples in WSD: High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training). The most complete evaluation of Web examples for supervised WSD indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline. Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002) with Agirre and Martínez (2004), or Agirre and Martínez (2000) with Mihalcea and Moldovan (1999). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples). Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense). It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus). In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense or assigning senses to Web directories and then scanning directories for examples; but this problem is nevertheless far from being solved. Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective. === Future === Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web: The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project), which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual). The Web has been used to enrich WordNet senses with domain information: topic signatures and Web directories, which have in turn been successfully used for WSD. Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages. It is clear, however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

    Read more →
  • Mark Keane (cognitive scientist)

    Mark Keane (cognitive scientist)

    Mark Thomas Gerard Keane (Irish: Marcus Ó Cathain, born 3 July 1961, Dublin, Ireland) is a cognitive scientist and author of several books on human cognition and artificial intelligence, including Cognitive Psychology: A Student's Handbook (8 editions, with Michael Eysenck), Advances in the Psychology of Thinking (1992, with Ken Gilhooly), Novice Programming Environments (1992/2018, with Marc Eisenstadt and Tim Rajan), Advances in Case-Based Reasoning (1995, with J-P Haton and Michel Manago)., Case-Based Reasoning: Research & Development (2022, with N Wiratunga). == Education == Keane received a B.A. in Psychology from University College Dublin in 1982. He then received a Ph.D. from Trinity College Dublin in 1987. He then moved to postdoctoral positions in Queen Mary University of London and the Open University. == Academic career == He was a Lecturer in Psychology at Cardiff University. He became a lecturer in Computer Science at Trinity College Dublin in 1990, and became a fellow in 1994. Keane moved to become Chair of Computer Science at University College Dublin in 1998. In 2006, he was seconded to Science Foundation Ireland as Director of ICT, overseeing on a $700m research investment. He advised the Irish Government on its 3.7B euro Strategy for Science, Technology & Innovation (SSTI). From 2006 to 2007, he was Director General of Science Foundation Ireland before returning to University College Dublin where he was appointed VP of Innovation & Partnerships (2007-2009). Keane's research has been split between cognitive science and computer science. His cognitive science research has been in analogy, metaphor, conceptual combination and similarity. His computer science research has been in natural language processing, machine learning, case-based reasoning, text analytics and explainable artificial intelligence. He has been a PI in the Science Foundation Ireland funded Insight Centre for Data Analytics working on digital journalism and digital humanities. More recently, he was deputy director of the VistaMilk SFI Research Centre that is exploring precision agriculture in the dairy sector.

    Read more →
  • Ranking SVM

    Ranking SVM

    In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems (via learning to rank). The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT. == Description == The ranking SVM algorithm is a learning retrieval function that employs pairwise ranking methods to adaptively sort results based on how 'relevant' they are for a specific query. The ranking SVM function uses a mapping function to describe the match between a search query and the features of each of the possible results. This mapping function projects each data pair (such as a search query and clicked web-page, for example) onto a feature space. These features are combined with the corresponding click-through data (which can act as a proxy for how relevant a page is for a specific query) and can then be used as the training data for the ranking SVM algorithm. Generally, ranking SVM includes three steps in the training period: It maps the similarities between queries and the clicked pages onto a certain feature space. It calculates the distances between any two of the vectors obtained in step 1. It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver. == Background == === Ranking method === Suppose C {\displaystyle \mathbb {C} } is a data set containing N {\displaystyle N} elements c i {\displaystyle c_{i}} . r {\displaystyle r} is a ranking method applied to C {\displaystyle \mathbb {C} } . Then the r {\displaystyle r} in C {\displaystyle \mathbb {C} } can be represented as a N × N {\displaystyle N\times N} binary matrix. If the rank of c i {\displaystyle c_{i}} is higher than the rank of c j {\displaystyle c_{j}} , i.e. r c i < r c j {\displaystyle r\ c_{i} Read more →

  • Wolfgang Ketter

    Wolfgang Ketter

    Wolfgang Ketter (born Traben-Trarbach, Germany, 1972) is Chaired Professor of Information Systems for a Sustainable Society at the University of Cologne. and a prominent scientist in the application of artificial intelligence, machine learning and intelligent agents in the design of smart markets, including demand response mechanisms and in particular automated auctions. He is a co-founder of the open energy system platform Power TAC, an automated retail electricity trading platform that simulates the performance of retail markets in an increasingly prosumer- and renewable-energy-influenced electricity landscape. == Career == === Advisory roles === Ketter is an advisor on the energy transition to the German government, in particular, the energy-intensive German state of North Rhine-Westphalia. He is also a fellow of the World Economic Forum and member of the WEF Global Council on Future Mobility and the Global New Mobility Coalition, contributing on the use of AI and machine learning to address issues arising from growth in electrification of energy such as the use of batteries as virtual power plants, the management of electric vehicle charging to prevent grid congestion, or the potential for peer-to-peer electricity trading. Ketter has also been an advisor for over a decade to the Port of Rotterdam on the design of energy cooperatives and energy trading platforms as well as one of the largest auction companies in the world, Royal FloraHolland, where his initial research led to a redesign of auction mechanisms and decision support systems. The cumulative research project team received the Association for Information Systems Impact Award in 2020 === Research === Ketter’s research is multidisciplinary, addressing the overlap of AI and ML in the economics of retail energy and mobility markets. The industry and policy applications of his research interconnect in large-scale projects such as the EU Smart city development project Ruggedised, for which the Erasmus University-based team's publication on the optimization of the City of Rotterdam's electric transit bus network was recognized with the Institute for Operations Research and the Management Sciences Daniel H. Wagner runner-up award. His research focuses on the use of competitive benchmarking and intelligent agents in virtual world simulations of retail energy markets as part of a smart grid. A small-scale version of the Power TAC project led to a publication on demand side management, 'A simulation of household behavior under variable prices' that has several hundred citations in publications representing a variety of scientific disciplines. Two of his publications in the Management Information Systems Quarterly journal and one in Energy Economics form the foundation for the current Power TAC platform. In 2016 and 2019 he was Chair of the Workshop on Information Technologies and Systems. Ketter is Coordinator of the Key Research Initiative Sustainable Smart Energy & Mobility at the University of Cologne, where he is a chaired Professor of Information Systems for a Sustainable Society. At the Rotterdam School of Management, Erasmus University, he is Professor of Next Generation Information Systems as well as Director of the Erasmus Centre for Future Energy Business and Academic Director of Smart Cities and Smart Energy at the Erasmus Centre of Data Analytics. He has been a visiting professor at the Haas School of Business and Berkeley Institute of Data Science, University of California at Berkeley in 2016 to 2017.

    Read more →
  • Static program analysis

    Static program analysis

    In computer science, static program analysis (also known as static analysis or static simulation) is the analysis of computer programs performed without executing them, in contrast with dynamic program analysis, which is performed on programs during their execution in the integrated environment. The term is usually applied to analysis performed by an automated tool, with human analysis typically being called "program understanding", program comprehension, or code review. In the last of these, software inspection and software walkthroughs are also used. In most cases the analysis is performed on some version of a program's source code, and, in other cases, on some form of its object code. Two leading approaches to resource certification have been Static Analysis (SA) and Implicit Computational Complexity (ICC). SA is algorithmic in nature: it focuses on a broad programming language of choice, and seeks to determine by syntactic means whether given programs in that language are feasible. In contrast, ICC attempts to create from the outset specialized programming languages or methods that delineate a complexity class. Thus, SA's focus is on compile time, making no demand on the programmer; whereas ICC is a language-design discipline." The discipline of static analysis should not be confused with linting, which is the process of checking for coding style mistakes. == Rationale == The sophistication of the analysis performed by tools varies from those that only consider the behaviour of individual statements and declarations, to those that include the complete source code of a program in their analysis. The uses of the information obtained from the analysis vary from highlighting possible coding errors (e.g., the lint tool) to formal methods that mathematically prove properties about a given program (e.g., its behaviour matches that of its specification). Software metrics and reverse engineering can be described as forms of static analysis. Deriving software metrics and static analysis are increasingly deployed together, especially in creation of embedded systems, by defining so-called software quality objectives. A growing commercial use of static analysis is in the verification of properties of software used in safety-critical computer systems and locating potentially vulnerable code. For example, the following industries have identified the use of static code analysis as a means of improving the quality of increasingly sophisticated and complex software: Medical software: The US Food and Drug Administration (FDA) has identified the use of static analysis for medical devices. Nuclear software: In the UK the Office for Nuclear Regulation (ONR) recommends the use of static analysis on reactor protection systems. Aviation software (in combination with dynamic analysis). Automotive & Machines (functional safety features form an integral part of each automotive product development phase, ISO 26262, section 8). A study in 2012 by VDC Research reported that 28.7% of the embedded software engineers surveyed use static analysis tools and 39.7% expect to use them within 2 years. A study from 2010 found that 60% of the interviewed developers in European research projects made at least use of their basic IDE built-in static analyzers. However, only about 10% employed an additional other (and perhaps more advanced) analysis tool. In the application security industry the name static application security testing (SAST) is also used. SAST is an important part of Security Development Lifecycles (SDLs) such as the SDL defined by Microsoft and a common practice in software companies. == Tool types == The OMG (Object Management Group) published a study regarding the types of software analysis required for software quality measurement and assessment. This document on "How to Deliver Resilient, Secure, Efficient, and Easily Changed IT Systems in Line with CISQ Recommendations" describes three levels of software analysis. Unit Level Analysis that takes place within a specific program or subroutine, without connecting to the context of that program. Technology Level Analysis that takes into account interactions between unit programs to get a more holistic and semantic view of the overall program in order to find issues and avoid obvious false positives. System Level Analysis that takes into account the interactions between unit programs, but without being limited to one specific technology or programming language. A further level of software analysis can be defined. Mission/Business Level Analysis that takes into account the business/mission layer terms, rules and processes that are implemented within the software system for its operation as part of enterprise or program/mission layer activities. These elements are implemented without being limited to one specific technology or programming language and in many cases are distributed across multiple languages, but are statically extracted and analyzed for system understanding for mission assurance. == Formal methods == Formal methods is the term applied to the analysis of software (and computer hardware) whose results are obtained purely through the use of rigorous mathematical methods. The mathematical techniques used include denotational semantics, axiomatic semantics, operational semantics, and abstract interpretation. By a straightforward reduction to the halting problem, it is possible to prove that (for any Turing complete language), finding all possible run-time errors in an arbitrary program (or more generally any kind of violation of a specification on the final result of a program) is undecidable: there is no mechanical method that can always answer truthfully whether an arbitrary program may or may not exhibit runtime errors. This result dates from the works of Church, Gödel and Turing in the 1930s (see: Halting problem and Rice's theorem). As with many undecidable questions, one can still attempt to give useful approximate solutions. Some of the implementation techniques of formal static analysis include: Abstract interpretation, to model the effect that every statement has on the state of an abstract machine (i.e., it 'executes' the software based on the mathematical properties of each statement and declaration). This abstract machine over-approximates the behaviours of the system: the abstract system is thus made simpler to analyze, at the expense of incompleteness (not every property true of the original system is true of the abstract system). If properly done, though, abstract interpretation is sound (every property true of the abstract system can be mapped to a true property of the original system). Data-flow analysis, a lattice-based technique for gathering information about the possible set of values; Hoare logic, a formal system with a set of logical rules for reasoning rigorously about the correctness of computer programs. There is tool support for some programming languages (e.g., the SPARK programming language (a subset of Ada) and the Java Modeling Language—JML—using ESC/Java and ESC/Java2, Frama-C WP (weakest precondition) plugin for the C language extended with ACSL (ANSI/ISO C Specification Language) ). Model checking, considers systems that have finite state or may be reduced to finite state by abstraction; Symbolic execution, as used to derive mathematical expressions representing the value of mutated variables at particular points in the code. Nullable reference analysis == Data-driven static analysis == Data-driven static analysis leverages extensive codebases to infer coding rules and improve the accuracy of the analysis. For instance, one can use all Java open-source packages available on GitHub to learn good analysis strategies. The rule inference can use machine learning techniques. It is also possible to learn from a large amount of past fixes and warnings. == Remediation == Static analyzers produce warnings. For certain types of warnings, it is possible to design and implement automated remediation techniques. For example, Logozzo and Ball have proposed automated remediations for C# cccheck.

    Read more →
  • Project Bergamot

    Project Bergamot

    Project Bergamot is a joint project between several European universities and Mozilla for the development of machine translation software based on artificial neural networks, which is intended for local execution on end-user devices. The software library that was created and the associated language models were made available to the general public as Free Software. Execution requires a x86 CPU with SSE4.1 instruction set extensions. In 2022, Devin Coldewey of TechCrunch judged the translation quality to be "more than adequate", but considered Firefox Translations to be not yet fully mature. == Usage == Mozilla used the Bergamot Translator to expand its web browser Firefox with a feature for translating web pages, which was previously considered an important gap in Firefox' feature set. It is often compared to the much older corresponding feature in Google Chrome, which utilizes a cloud-based background service. In contrast, Firefox Translations does not require any data to leave the user's computer, resulting in advantages in terms of data protection, availability and possibly response times. There is just the installation of a new language model that needs to take place the first time a new language is encountered. Greater independence from large technology companies and their interests is also mentioned as an important advantage. Mozilla thus strengthened its position as an alternative software vendor with a particular focus on data protection and security. Mozilla followed up with the similar feature of speech recognition for spoken user input, based on whisperfile. On the other hand, slow translation times have been observed, especially on older devices. Also, Firefox Translations initially supported far fewer language pairs than other major translation services and is only gradually adding new models. On that matter, the training pipeline is also made available to interested parties to enable the creation of missing language models. TranslateLocally is a Firefox-independent translation software based on the Bergamot Translator. It is also available as an (Electron-based) standalone application or as an extension for Chromium-based web browsers. == History == Mozilla had already tried to get a (cloud-based) web content translation feature into Firefox a few years before Project Bergamot, but had failed because of the financial challenge. Microsoft had already delivered offline capabilities for its translation software in 2018. Google soon followed suit, Apple two years later. The software is based on the free translation framework Marian, which the University of Edinburgh had previously developed in cooperation with Microsoft, and is itself based on the Nematus toolkit that was presented in 2017. Under the leadership of the University of Edinburgh, a development consortium was formed with the Mozilla Corporation and the additional European universities of Prague, Sheffield and Tartu. In 2018, it was able to get 3 million euros of funding from the EU's Horizon 2020 programme. Firefox Translations was initially provided as an add-on. A first functional demonstration prototype was presented in October 2019. Beta version 117 had the feature integrated directly into the browser, the official release was in version 118 from September 2023. Both the add-on module and as part of Firefox, the code and the models are subject to the version 2 of the Mozilla Public License. Since 2022, the EU-funded HPLT project creates new language models. It involves additional partners, including the universities of Helsinki, Turku, Oslo and other partners from Spain, Norway and the Czech Republic.

    Read more →
  • How to Choose an AI Writing Assistant

    How to Choose an AI Writing Assistant

    Comparing the best AI writing assistant? An AI writing assistant is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI writing assistant slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • Sophia Ananiadou

    Sophia Ananiadou

    Sophia Ananiadou is a Greek-British computer scientist and computational linguist. She led the development of and directs the National Centre for Text Mining (NaCTeM) in the United Kingdom. She is also Professor in Computer Science in the Department of Computer Science at the University of Manchester. Her research focusses on biomedical text mining and natural language processing and has fed into the development of numerous applications that, for example, facilitate the discovery of new knowledge, enable exploration of historical archives, allow semantic search of biomedical literature, reduce human effort in screening search hits for production of systematic reviews, enable enrichment of metabolic pathway models with evidence from the literature, allow discovery of risk in the construction industry from health and safety incident reports and enable interoperability of components in text mining workflows. == Education == Ananiadou was educated at the Lycée français St Joseph in Athens, Greece (1969–1975). She received a Bachelor of Arts (Ptychion) from the University of Athens (1979), a Master of Advanced Studies (DEA) in Linguistics from Paris VII, Jussieu, France (1980), a DEA in Literature from Paris IV, Sorbonne, France (1984) and a PhD in Computational linguistics from the University of Manchester Institute of Science and Technology (UMIST), in 1988. == Career and research == Ananiadou was a research assistant at Dalle Molle Institute for Semantic and Cognitive Studies (ISSCO, 1983–1984), a research assistant (1985–1988) then research associate (1988–1993) in the department of language engineering at UMIST, senior lecturer at Manchester Metropolitan University (1993–1999), senior lecturer then reader in the School of Computing Science and Engineering, University of Salford (2000–2005), then reader in the School of Computer Science, University of Manchester (2005–2009). Since 2009, she has served as professor in computer science in the Department of Computer Science at the University of Manchester. In July 2025, she became deputy director of the Christabel Pankhurst Institute for health technology research and innovation, University of Manchester. From 2018–2026, she served as the deputy director of the Institute for Data Science and Artificial Intelligence, University of Manchester. She is a senior lead researcher of the ARCHIMEDES research unit of the Athena Research Centre, Greece. ARCHIMEDES is a research and innovation hub fostering international collaboration and knowledge exchange on Artificial Intelligence and Data Science. On February 7, 2025, she was appointed a member of the Artificial Intelligence Sectoral Scientific Council of the Greek Ministry of Development (announcement of appointment in Greek). She is also a Visiting Distinguished Research Fellow in the Knowledge and Information Research Team at the Artificial Intelligence Research Center (AIRC), Japan, which is a research unit of the Japanese National Institute of Advanced Industrial Science and Technology (AIST). In addition, she was appointed to the honorary position of Adjunct Professor of Wuhan University, People's Republic of China, for the period October 2025 to October 2028, collaborating with the School of Artificial Intelligence. Ananiadou has published since 1986, has an h-index of 81 and a Research.com United Kingdom ranking in Computer Science of 104. She is also ranked number 1 internationally in text mining by ScholarGPS. In addition, she is included in the Stanford/Elsevier Top 2% Scientist Rankings for 2025. Ananiadou received a Diplôme de traducteur (Diploma of Translator) from the Institut français d'Athènes, Greece (1979) and a Certificate in Counselling from the University of Salford, UK (2004). === Awards and honours === In 2019, in recognition of her contributions in Artificial Intelligence and text mining for Biomedicine, Ananiadou received an honorary doctorate from the University of the Aegean, on the 20th anniversary of its Department of Mediterranean Studies, Rhodes. Ananiadou received the Unstructured Information Management Architecture (UIMA) innovation award from IBM three years running (2006, 2007 & 2008). She was awarded the Daiwa Adrian Prize in 2004 and also received a Japan Trust award from the Ministry of Education, Japan in 1997. Ananiadou was a Turing Fellow of the Alan Turing Institute in London from 2018 to 2023. Since 2021, she is a member and, since 2024, a Fellow, of the ELLIS Society, the professional society of the cross-national European Laboratory for Learning and Intelligent Systems. Ananiadou served as vice president (VP) of the European Association for Terminology from 1997 to 1999. At the 28th International Conference on Computational Linguistics (COLING 2020), she received, with M. Li and H. Takamura, an Outstanding Paper designation for the paper "A Neural Model for Aggregating Coreference Annotation in Crowdsourcing".

    Read more →
  • Retrieval-augmented generation

    Retrieval-augmented generation

    Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from external data sources. With RAG, LLMs first refer to a specified set of documents, then respond to user queries. These documents supplement information from the LLM's pre-existing training data. This allows LLMs to use domain-specific and/or updated information that is not available in the training data. For example, this enables LLM-based chatbots to access internal company data or generate responses based on authoritative sources. RAG improves LLMs by incorporating information retrieval before generating responses. Unlike LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According to Ars Technica, "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduce AI hallucinations, which have caused chatbots to describe policies that don't exist, or recommend nonexistent legal cases to lawyers that are looking for citations to support their arguments. RAG also reduces the need to retrain LLMs with new data, saving on computational and financial costs. Beyond efficiency gains, RAG also allows LLMs to include sources in their responses, so users can verify the cited sources. This provides greater transparency, as users can cross-check retrieved content to ensure accuracy and relevance. The term retrieval-augmented generation (RAG) was introduced in a 2020 paper that described combining a parametric language model with a non-parametric external memory accessed through retrieval at inference time. == RAG and LLM limitations == LLMs can provide incorrect information. For example, when Google first demonstrated its LLM tool "Google Bard" (later re-branded to Gemini), the LLM provided incorrect information about the James Webb Space Telescope. This error contributed to a $100 billion decline in Google's stock value. RAG is used to prevent these errors, but it does not solve all the problems. For example, LLMs can generate misinformation even when pulling from factually correct sources if they misinterpret the context. MIT Technology Review gives the example of an AI-generated response stating, "The United States has had one Muslim president, Barack Hussein Obama." The model retrieved this from an academic book rhetorically titled Barack Hussein Obama: America's First Muslim President? The LLM did not "know" or "understand" the context of the title, generating a false statement. LLMs with RAG are programmed to prioritize new information. This technique has been called "prompt stuffing." Without prompt stuffing, the LLM's input is generated by a user; with prompt stuffing, additional relevant context is added to this input to guide the model's response. This approach provides the LLM with key information early in the prompt, encouraging it to prioritize the supplied data over pre-existing training knowledge. == Process == Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating an information-retrieval mechanism that allows models to access and utilize additional data beyond their original training set. Ars Technica notes that "when new information becomes available, rather than having to retrain the model, all that's needed is to augment the model's external knowledge base with the updated information" ("augmentation"). IBM states that "in the generative phase, the LLM draws from the augmented prompt and its internal representation of its training data to synthesize" an answer. === RAG key stages === Typically, the data to be referenced is converted into LLM embeddings, numerical representations in the form of a large vector space. RAG can be used on unstructured (usually text), semi-structured, or structured data (for example knowledge graphs). These embeddings are then stored in a vector database to allow for document retrieval. Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query. This comparison can be done using a variety of methods, which depend in part on the type of indexing used. The model feeds this relevant retrieved information into the LLM via prompt engineering of the user's original query. Newer implementations (as of 2023) can also incorporate specific augmentation modules with abilities such as expanding queries into multiple domains and using memory and self-improvement to learn from previous retrievals. Finally, the LLM can generate output based on both the query and the retrieved documents. Some models incorporate extra steps to improve output, such as the re-ranking of retrieved information, context selection, and fine-tuning. == Applications == Retrieval-augmented generation is used in applications where generated responses need to be grounded in external or frequently updated information. Commonly cited use cases include search engines, question-answering systems, customer support chatbots, enterprise knowledge assistants, content generation, recommendation systems, retail and e-commerce, and industrial or manufacturing workflows. In healthcare, RAG has been studied as a way to ground large language model outputs in external medical knowledge sources, although reviews have noted continuing challenges around evaluation, ethics, and clinical reliability. == Improvements == Improvements to the basic process above can be applied at different stages in the RAG flow. === Encoder === These methods focus on the encoding of text as either dense or sparse vectors. Sparse vectors, which encode the identity of a word, are typically dictionary-length and contain mostly zeros. Dense vectors, which encode meaning, are more compact and contain fewer zeros. Various enhancements can improve the way similarities are calculated in the vector stores (databases). Performance improves by optimizing how vector similarities are calculated. Dot products enhance similarity scoring, while approximate nearest neighbor (ANN) searches improve retrieval efficiency over K-nearest neighbors (KNN) searches. Accuracy may be improved with Late Interactions, which allow the system to compare words more precisely after retrieval. This helps refine document ranking and improve search relevance. Hybrid vector approaches may be used to combine dense vector representations with sparse one-hot vectors, taking advantage of the computational efficiency of sparse dot products over dense vector operations. Other retrieval techniques focus on improving accuracy by refining how documents are selected. Some retrieval methods combine sparse representations, such as SPLADE, with query expansion strategies to improve search accuracy and recall. === Retriever-centric methods === These methods aim to enhance the quality of document retrieval in vector databases: Pre-training the retriever using the Inverse Cloze Task (ICT), a technique that helps the model learn retrieval patterns by predicting masked text within documents. Supervised retriever optimization aligns retrieval probabilities with the generator model's likelihood distribution. This involves retrieving the top-k vectors for a given prompt, scoring the generated response's perplexity, and minimizing KL divergence between the retriever's selections and the model's likelihoods to refine retrieval. Reranking techniques can refine retriever performance by prioritizing the most relevant retrieved documents during training. === Language model === By redesigning the language model with the retriever in mind, a 25-time smaller network can get comparable perplexity as its much larger counterparts. Because it is trained from scratch, this method (Retro) incurs the high cost of training runs that the original RAG scheme avoided. The hypothesis is that by giving domain knowledge during training, Retro needs less focus on the domain and can devote its smaller weight resources only to language semantics. The redesigned language model is shown here. It has been reported that Retro is not reproducible, so modifications were made to make it so. The more reproducible version is called Retro++ and includes in-context RAG. === Chunking === Chunking involves various strategies for breaking up the data into vectors so the retriever can find details in it. Three types of chunking strategies are: Fixed length with overlap. This is fast and easy. Overlapping consecutive chunks helps to maintain semantic context across chunks. Syntax-based chunks can break the document up into sentences. Libraries such as spaCy or NLTK can also help. File format-based chunking. Certain file types have natural chunks built in, and it's best to respect them. For example, code files are best chunked and vectorized as whole functions or classes. HTML files should leave

    or base64 encoded elements

    Read more →
  • Ross Quinlan

    Ross Quinlan

    John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. He also contributed to early ILP literature with First Order Inductive Learner (FOIL). He is currently running the company RuleQuest Research which he founded in 1997. == Education == He received his BSc degree in Physics and Computing from the University of Sydney in 1965 and his computer science doctorate at the University of Washington in 1968. He has held positions at the University of New South Wales, University of Sydney, University of Technology Sydney, and RAND Corporation. == Artificial intelligence == Quinlan is a specialist in artificial intelligence, particularly in the aspect involving machine learning and its application to data mining. He is a Founding Fellow of the Association for the Advancement of Artificial Intelligence. === ID3 === Ross Quinlan invented the Iterative Dichotomiser 3 (ID3) algorithm which is used to generate decision trees. ID3 follows the principle of Occam's razor in attempting to create the smallest decision tree possible. === C4.5 === He then expanded upon the principles used in ID3 to create C4.5. C4.5 improved: discrete and continuous attributes, missing attribute values, attributes with differing costs, pruning trees (replacing irrelevant branches with leaf nodes). === C5.0 === C5.0, which Quinlan is commercially selling (single-threaded version is distributed under the terms of the GNU General Public License), is an improvement on C4.5. The advantages are speed (several orders of magnitude faster), memory efficiency, smaller decision trees, boosting (more accuracy), ability to weight different attributes, and winnowing (reducing noise). == Selected works == === Books === 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. ISBN 1-55860-238-0. === Articles === Quinlan, J. R. (1982) Semi-autonomous acquisition of pattern-based knowledge, In Machine intelligence 10 (eds J. E. Hayes, D. Michie, and Y.-H. Pao). Ellis Norwood,Chichester. Quinlan, J.R. (1985). Decision trees and multi-valued attributes, In J.E. Hayes & D. Michie (Eds.), Machine intelligence 11. Oxford University Press. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81-106 2008. (with Qiang Yang, Philip S. Yu, Zhou Zhihua, and David Hand et al). Top 10 algorithms in data mining. Knowledge and Information Systems 14.1: 1-37 Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5:239-266.

    Read more →
  • Multiple sequence alignment

    Multiple sequence alignment

    Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations (single amino acid or nucleotide changes), insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides. Multiple sequence alignments require more sophisticated methodologies than pairwise alignments, as they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. However, heuristic methods generally cannot guarantee high-quality solutions and have been shown to fail to yield near-optimal solutions on benchmark test cases. == Problem statement == Given m {\displaystyle m} sequences S i {\displaystyle S_{i}} , i = 1 , ⋯ , m {\displaystyle i=1,\cdots ,m} similar to the form below: S := { S 1 = ( S 11 , S 12 , … , S 1 n 1 ) S 2 = ( S 21 , S 22 , ⋯ , S 2 n 2 ) ⋮ S m = ( S m 1 , S m 2 , … , S m n m ) {\displaystyle S:={\begin{cases}S_{1}=(S_{11},S_{12},\ldots ,S_{1n_{1}})\\S_{2}=(S_{21},S_{22},\cdots ,S_{2n_{2}})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S_{m}=(S_{m1},S_{m2},\ldots ,S_{mn_{m}})\end{cases}}} A multiple sequence alignment is taken of this set of sequences S {\displaystyle S} by inserting any amount of gaps needed into each of the S i {\displaystyle S_{i}} sequences of S {\displaystyle S} until the modified sequences, S i ′ {\displaystyle S'_{i}} , all conform to length L ≥ max { n i ∣ i = 1 , … , m } {\displaystyle L\geq \max\{n_{i}\mid i=1,\ldots ,m\}} and no values in the sequences of S {\displaystyle S} of the same column consists of only gaps. The mathematical form of an MSA of the above sequence set is shown below: S ′ := { S 1 ′ = ( S 11 ′ , S 12 ′ , … , S 1 L ′ ) S 2 ′ = ( S 21 ′ , S 22 ′ , … , S 2 L ′ ) ⋮ S m ′ = ( S m 1 ′ , S m 2 ′ , … , S m L ′ ) {\displaystyle S':={\begin{cases}S'_{1}=(S'_{11},S'_{12},\ldots ,S'_{1L})\\S'_{2}=(S'_{21},S'_{22},\ldots ,S'_{2L})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S'_{m}=(S'_{m1},S'_{m2},\ldots ,S'_{mL})\end{cases}}} To return from each particular sequence S i ′ {\displaystyle S'_{i}} to S i {\displaystyle S_{i}} , remove all gaps. == Graphing approach == A general approach when calculating multiple sequence alignments is to use graphs to identify all of the different alignments. When finding alignments via graph, a complete alignment is created in a weighted graph that contains a set of vertices and a set of edges. Each of the graph edges has a weight based on a certain heuristic that helps to score each alignment or subset of the original graph. === Tracing alignments === When determining the best suited alignments for each MSA, a trace is usually generated. A trace is a set of realized, or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences. == Alignment methods == There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences. === Dynamic programming === A direct method for producing an MSA uses the dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment. For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. Expressed with the big O notation commonly used to measure computational complexity, a naïve MSA takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has been shown to be an NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm, Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space. In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called sum of pair score) and has been implemented in a software program for constructing multiple sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity. === Progressive alignment construction === The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987. Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a phylogenetic tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or unweighted pair group method with arithmetic mean (UPGMA), and may use distances based on the number of identical two-letter sub-sequences (as in FASTA rather than a dynamic programming alignment). Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program. Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. A popular progressive alignment method has been the Clustal family. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for protein structure prediction by homology modeling. European Bioinformatics Institute (EMBL-EBI) announced that CLustalW2 will expire in August 2015. They recommend Clustal Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. An alternative tool for progressive DNA alignments is multiple alignment using fast Fourier transform (MAFFT). Another common progressive alignment method named T-Coffee is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate w

    Read more →