AI App On My Phone

AI App On My Phone — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Lazy learning

    Lazy learning

    (Not to be confused with the lazy learning regime, see Neural tangent kernel). In machine learning, lazy learning is a learning method in which generalization of the training data is, in theory, delayed until a query is made to the system, as opposed to eager learning, where the system tries to generalize the training data before receiving queries. The primary motivation for employing lazy learning, as in the K-nearest neighbors algorithm, used by online recommendation systems ("people who viewed/purchased/listened to this movie/item/tune also ...") is that the data set is continuously updated with new entries (e.g., new items for sale at Amazon, new movies to view at Netflix, new clips at YouTube, new music at Spotify or Pandora). Because of the continuous update, the "training data" would be rendered obsolete in a relatively short time especially in areas like books and movies, where new best-sellers or hit movies/music are published/released continuously. Therefore, one cannot really talk of a "training phase". Lazy classifiers are most useful for large, continuously changing datasets with few attributes that are commonly queried. Specifically, even if a large set of attributes exist - for example, books have a year of publication, author/s, publisher, title, edition, ISBN, selling price, etc. - recommendation queries rely on far fewer attributes - e.g., purchase or viewing co-occurrence data, and user ratings of items purchased/viewed. == Advantages == The main advantage gained in employing a lazy learning method is that the target function will be approximated locally, such as in the k-nearest neighbor algorithm. Because the target function is approximated locally for each query to the system, lazy learning systems can simultaneously solve multiple problems and deal successfully with changes in the problem domain. At the same time they can reuse a lot of theoretical and applied results from linear regression modelling (notably PRESS statistic) and control. It is said that the advantage of this system is achieved if the predictions using a single training set are only developed for few objects. This can be demonstrated in the case of the k-NN technique, which is instance-based and function is only estimated locally. == Disadvantages == Theoretical disadvantages with lazy learning include: The large space requirement to store the entire training dataset. In practice, this is not an issue because of advances in hardware and the relatively small number of attributes (e.g., as co-occurrence frequency) that need to be stored. Particularly noisy training data increases the case base unnecessarily, because no abstraction is made during the training phase. In practice, as stated earlier, lazy learning is applied to situations where any learning performed in advance soon becomes obsolete because of changes in the data. Also, for the problems for which lazy learning is optimal, "noisy" data does not really occur - the purchaser of a book has either bought another book or hasn't. Lazy learning methods are usually slower to evaluate. In practice, for very large databases with high concurrency loads, the queries are not postponed until actual query time, but recomputed in advance on a periodic basis - e.g., nightly, in anticipation of future queries, and the answers stored. This way, the next time new queries are asked about existing entries in the database, the answers are merely looked up rapidly instead of having to be computed on the fly, which would almost certainly bring a high-concurrency multi-user system to its knees. Larger training data also entail increased cost. Particularly, there is the fixed amount of computational cost, where a processor can only process a limited amount of training data points. There are standard techniques to improve re-computation efficiency so that a particular answer is not recomputed unless the data that impact this answer has changed (e.g., new items, new purchases, new views). In other words, the stored answers are updated incrementally. This approach, used by large e-commerce or media sites, has long been used in the Entrez portal of the National Center for Biotechnology Information (NCBI) to precompute similarities between the different items in its large datasets: biological sequences, 3-D protein structures, published-article abstracts, etc. Because "find similar" queries are asked so frequently, the NCBI uses highly parallel hardware to perform nightly recomputation. The recomputation is performed only for new entries in the datasets against each other and against existing entries: the similarity between two existing entries need not be recomputed. == Examples of Lazy Learning Methods == K-nearest neighbors, which is a special case of instance-based learning. Local regression. Lazy naive Bayes rules, which are extensively used in commercial spam detection software. Here, the spammers keep getting smarter and revising their spamming strategies, and therefore the learning rules must also be continually updated.

    Read more →
  • The Best Free AI Writing Assistant for Beginners

    The Best Free AI Writing Assistant for Beginners

    Shopping for the best AI writing assistant? An AI writing assistant is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI writing assistant slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • Universal Networking Language

    Universal Networking Language

    Universal Networking Language (UNL) is a declarative formal language specifically designed to represent semantic data extracted from natural language texts. It can be used as a pivot language in interlingual machine translation systems or as a knowledge representation language in information retrieval applications. == Structure == In UNL, the information conveyed by the natural language is represented sentence by sentence as a hypergraph composed of a set of directed binary labeled links between nodes or hypernodes. As an example, the English sentence "The sky was blue?!" can be represented in UNL as follows: In the example above, sky(icl>natural world) and blue(icl>color), which represent individual concepts, are UW's attributes of an object directed to linking the semantic relation between the two UWs; "@def", "@interrogative", "@past", "@exclamation" and "@entry" are attributes modifying UWs. UWs are expressed in natural language to be humanly readable. They consist of a "headword" (the UW root) and a "constraint list" (the UW suffix between parentheses), where the constraints are used to disambiguate the general concept conveyed by the headword. The set of UWs is organized in the UNL Ontology. Relations are intended to represent semantic links between words in every existing language. They can be ontological (such as "icl" and "iof"), logical (such as "and" and "or"), or thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc.). There are currently 46 relations in the UNL Specs that jointly define the UNL syntax. Within the UNL program, the process of representing natural language sentences in UNL graphs is called UNLization, and the process of generating natural language sentences out of UNL graphs is called NLization. UNLization is intended to be carried out semi-automatically (i.e., by humans with computer aids), and NLization is intended to be carried out automatically. == History == The UNL program started in 1996 as an initiative of the Institute of Advanced Studies (IAS) of the United Nations University (UNU) in Tokyo, Japan. In January 2001, the United Nations University set up an autonomous and non-profit organization, the UNDL Foundation, to be responsible for the development and management of the UNL program. It inherited from the UNU/IAS the mandate of implementing the UNL program. The overall architecture of the UNL System has been developed with a set of basic software and tools. It was recognized by the Patent Cooperation Treaty (PCT) for the "industrial applicability" of the UNL, which was obtained in May 2002 through the World Intellectual Property Organization (WIPO); the UNL acquired the US patents 6,704,700 and 7,107,206.

    Read more →
  • Is an AI Writing Assistant Worth It in 2026?

    Is an AI Writing Assistant Worth It in 2026?

    In search of the best AI writing assistant? An AI writing assistant is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI writing assistant slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • List of online database creator apps

    List of online database creator apps

    This list of online database creator apps lists notable web apps where end users with minimal database administration expertise can create online databases to share with team members. Users need not have the coding skills to manage the solution stack themselves, because the web app already provides this predefined functionality. Such online database creator apps serve the gap between IT professionals (who can manage such a stack themselves) and people who would not create databases at all anyway. In other words, they provide a low-code way of doing database administration. As the concept of low-code development in general continues to evolve, some of the brands that began as online database creator apps are evolving into low-code development platforms for both the databases and the custom apps that use them. Airtable Bubble Caspio Coda.io Microsoft Access web apps plus SharePoint Oracle Application Express aka APEX Quickbase WaveMaker Rapid ZohoCreator

    Read more →
  • Cortana (virtual assistant)

    Cortana (virtual assistant)

    Cortana is a discontinued virtual assistant developed by Microsoft that used the Bing search engine to perform tasks such as setting reminders and answering questions for users. Cortana was available in English, Portuguese, French, German, Italian, Spanish, Chinese, and Japanese language editions, depending on the software platform and region in which it was used. In 2019, Microsoft began reducing the prevalence of Cortana and converting it from an assistant into different software integrations. It was split from the Windows 10 search bar in April 2019. In January 2020, the Cortana mobile app was removed from certain markets, and on March 31, 2021, the Cortana mobile app was shut down globally. On June 2, 2023, Microsoft announced that support for the Cortana standalone app on Microsoft Windows would end in late 2023 and would be replaced by Microsoft Copilot, an AI chatbot. Support for Cortana in the Microsoft Outlook and Microsoft 365 mobile apps was discontinued in fall of 2023. == History == === Beginnings (2009–2014) === The development of Cortana started in 2009 in the Microsoft Speech products team with general manager Zig Serafin and Chief Scientist Larry Heck. Heck and Serafin established the vision, mission, and long-range plan for Microsoft's digital personal assistant and they built a team with the expertise to create the initial prototypes for Cortana. Some of the key researchers in these early efforts included Microsoft Research researchers Dilek Hakkani-Tür, Gokhan Tur, Andreas Stolcke, and Malcolm Slaney, research software developer Madhu Chinthakunta, and user experience designer Lisa Stifelman. To develop the Cortana digital assistant, the team interviewed human personal assistants. The interviews inspired a number of unique features in Cortana, including the assistant's "notebook" feature. Originally, Cortana was meant to be only a codename, but a petition on Windows Phone's UserVoice site proved to be popular and made the codename official. Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows. It was named after Cortana, a synthetic intelligence character in Microsoft's Halo video game franchise originating in Bungie folklore, with Jen Taylor, the character's voice actress, returning to voice the personal assistant's US-specific version. === Expansion (2015–2018) === In January 2015, Microsoft announced the availability of Cortana for Windows 10 desktops and mobile devices as part of merging Windows Phone into the operating system at large. On May 26, 2015, Microsoft announced that Cortana would also be available on other mobile platforms. An Android release was set for July 2015, but the Android APK file containing Cortana was leaked ahead of its release. It was officially released, along with an iOS version, in December 2015. During E3 2015, Microsoft announced that Cortana would come to the Xbox One as part of a universally designed Windows 10 update for the console. Microsoft integrated Cortana into numerous products such as Microsoft Edge. Microsoft's Cortana assistant was deeply integrated into the browser. Cortana was able to find opening hours when on restaurant sites, show retail coupons for websites, or show weather information in the address bar. At the Worldwide Partners Conference 2015 Microsoft demonstrated Cortana integration with products such as GigJam. Conversely, Microsoft announced in late April 2016 that it would block anything other than Bing and Edge from being used to complete Cortana searches, again raising questions of anti-competitive practices by the company. Microsoft's "Windows in the car" concept included Cortana. The concept makes it possible for drivers to make restaurant reservations and see places before they go there. At Microsoft Build 2016, Microsoft announced plans to integrate Cortana into Skype (Microsoft's video-conferencing and instant messaging service) as a bot to allow users to order food, book trips, transcribe video messages and make calendar appointments through Cortana in addition to other bots. As of 2016, Cortana was able to underline certain words and phrases in Skype conversations that relate to contacts and corporations. A writer from Engadget has criticised the Cortana integration in Skype for responding only to very specific keywords, feeling as if she was "chatting with a search engine" due to the impersonal way the bots replied to certain words such as "Hello" causing the Bing Music bot to bring up Adele's song of that name. Microsoft also announced at Microsoft Build 2016 that Cortana would be able to cloud-synchronise notifications between Windows 10 Mobile's and Windows 10's Action Center, as well as notifications from Android devices. In December 2016, Microsoft announced the preview of Calendar.help, a service that enabled people to delegate the scheduling of meetings to Cortana. Users interact with Cortana by including her in email conversations. Cortana would then check people's availability in Outlook Calendar or Google Calendar, and work with others Cc'd on the email to schedule the meeting. The service relied on automation and human-based computation. In May 2017, Microsoft announced INVOKE, a voice-activated speaker featuring Cortana, in collaboration with Harman Kardon. The premium speaker has a cylindrical design and offers 360-degree sound, the ability to make and receive calls with Skype, and all of the other features currently available with Cortana. In 2017, Microsoft partnered with Amazon to integrate Echo and Cortana with each other, allowing users of each smart assistant to summon the other via a command. This feature preview was released in August 2018. Windows 10 users were able to just say "Hey Cortana, open Alexa" and Echo users were able to say "Alexa, open Cortana" to summon the other assistant. === Decreasing focus and discontinuation (2019–2024) === In January 2019, Microsoft CEO Satya Nadella stated that he no longer saw Cortana as a direct competitor against Alexa and Siri. Shortly thereafter, Microsoft began reducing the prevalence of Cortana and converting it from an assistant into different software integrations. It was split from the Windows 10 search bar in April 2019. In January 2020, the Cortana mobile app was removed from certain markets, and then, on July 24, 2020, Cortana was removed from the Xbox dashboard as part of a redesign. On January 31, 2021, Microsoft removed the Cortana mobile application in many markets, including the UK, Australia, Germany, Mexico, China, Spain, Canada, and India. On March 31, 2021, Microsoft shut down the Cortana apps globally for iOS and Android and removed the apps entirely from their corresponding app stores. To access previously recorded content, users had to use Cortana on Windows 10 or other specialized Microsoft applications. Microsoft also reduced emphasis on Cortana in Windows with the 2021 release of Windows 11. Cortana was not used during the device setup process or pinned to the taskbar by default. On June 2, 2023, Microsoft announced the Cortana standalone app on Windows 10 and Windows 11 which would shut down later in the year. In its support article, Microsoft listed several alternatives, most of which have since been rebranded as Microsoft Copilot. They also added that the change would not impact Cortana in Office 365 and Teams environments. On August 11, 2023, Microsoft updated the Cortana standalone app in Windows, informing that it was deprecated and can no longer be used. Microsoft's support article announcing the deprecation of Cortana was updated to reflect this change. Along with the deprecation of the standalone app, it was announced that Cortana support in Teams mobile, Microsoft Teams displays, and Teams rooms would end in late 2023. The support article states that Cortana in the “Play my emails” feature of the Microsoft Outlook mobile app would continue to be available. Later in June 2024, the support article was updated, stating that Cortana in the voice search and the "Play my emails" feature is now removed from the Microsoft Outlook mobile app, officially marking the discontinuation of Cortana across all Microsoft products. On May 22, 2024, Microsoft announced the Windows 11 24H2 update, which removed Cortana, Tips, and WordPad from systems. == Functionality == Cortana was able to set reminders, recognize natural voice without the requirement for keyboard input, and answer questions using information from the Bing search engine. Searches using Windows 10 are made only with the Microsoft Bing search engine, and all links will open with Microsoft Edge, except when a screen reader such as Narrator was being used, where the links will open in Internet Explorer. Windows Phone 8.1's universal Bing SmartSearch features were incorporated into Cortana, which replaced the

    Read more →
  • Deborah Raji

    Deborah Raji

    Inioluwa Deborah Raji (born 1995/1996) is a Nigerian-Canadian computer scientist and socio-tech leader who works on algorithmic bias, AI accountability, and algorithmic auditing. A current Mozilla fellow, she has been recognized by MIT Technology Review and Forbes as one of the world's top young innovators. Raji started her work with racial bias in technology during her internship with Clarifai when she recognized that people of color were more often tagged for NSFW compared to white people. Raji has previously worked with Joy Buolamwini, Timnit Gebru, and the Algorithmic Justice League on researching gender and racial bias in facial recognition technology. Her work on racial bias in facial recognition has forced companies to ultimately change their practices. She has also worked with Google’s Ethical AI team and been a research fellow at the Partnership on AI and AI Now Institute at New York University working on how to operationalize ethical considerations in machine learning engineering practice. She was working on a computer vision model that would help clients flag inappropriate images as NSFW. == Early life and education == Raji was born in Port Harcourt, Nigeria, and moved to Mississauga, Ontario, Canada, when she was four years old. Eventually her family moved to Ottawa. She attended Colonel By Secondary School and completed the International Baccalaureate programme. She studied Engineering Science at the University of Toronto, graduating in 2019. In 2015, she founded Project Include, a nonprofit providing increased student access to engineering education, mentorship, and resources in low income and immigrant communities in the Greater Toronto Area. She started a Doctor of Philosophy - PhD, in Computer Science from the University of California, Berkeley in Aug 2021. == Career and research == Raji worked with Joy Buolamwini at the MIT Media Lab and Algorithmic Justice League, where she audited commercial facial recognition technologies from Microsoft, Amazon, IBM, Face++, and Kairos. They found that these technologies were significantly less accurate for darker-skinned women than for white men. With support from other top AI researchers and increased public pressure and campaigning, their work led IBM and Amazon to agree to support facial recognition regulation and later halt the sale of their product to police for at least a year. Raji also interned at machine learning startup Clarifai, where she worked on a computer vision model for flagging images. She participated in a research mentorship program at Google and worked with their Ethical AI team on creating model cards, a documentation framework for more transparent machine learning model reporting. She also co-led the development of internal auditing practices at Google. Her contributions at Google were separately presented and published at the AAAI conference and ACM Conference on Fairness, Accountability, and Transparency. In 2019, Raji was a summer research fellow at The Partnership on AI working on setting industry machine learning transparency standards and benchmarking norms. Raji was a Tech Fellow at the AI Now Institute worked on algorithmic and AI auditing. Currently, she is a fellow at the Mozilla Foundation researching algorithmic auditing and evaluation. Raji's work on bias in facial recognition systems has been highlighted in the 2020 documentary Coded Bias directed by Shalini Kantayya. She also took part in the 2026 documentary The AI Doc: Or How I Became an Apocaloptimist directed by Daniel Roher. == Awards == 2019 Venture Beat AI Innovations Award in category AI for Good (received with Joy Buolamwini and Timnit Gebru) 2020 MIT Technology Review 35 Under 35 Innovator Award 2020 EFF Pioneer Award (received with Buolamwini and Gebru) 2021 Forbes 30 Under 30 Award in Enterprise Technology 2021 100 Brilliant Women in AI Ethics Hall of Fame Honoree 2023 Time magazine 100 Most Influential People in AI

    Read more →
  • Geoffrey J. Gordon

    Geoffrey J. Gordon

    Geoffrey J. Gordon is a professor at the Machine Learning Department at Carnegie Mellon University in Pittsburgh and director of research at the Microsoft Montréal lab. He is known for his research in statistical relational learning (a subdiscipline of artificial intelligence and machine learning) and on anytime dynamic variants of the A search algorithm. His research interests include multi-agent planning, reinforcement learning, decision-theoretic planning, statistical models of difficult data (e.g. maps, video, text), computational learning theory, and game theory. Gordon received a B.A. in computer science from Cornell University in 1991, and a PhD at Carnegie Mellon in 1999.

    Read more →
  • Referring expression generation

    Referring expression generation

    Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions (noun phrases) that identify specific entities called targets. This task can be split into two sections. The content selection part determines which set of properties distinguish the intended target and the linguistic realization part defines how these properties are translated into natural language. A variety of algorithms have been developed in the NLG community to generate different types of referring expressions. == Types of referring expressions == A referring expression (RE), in linguistics, is any noun phrase, or surrogate for a noun phrase, whose function in discourse is to identify some individual object (thing, being, event...) The technical terminology for identify differs a great deal from one school of linguistics to another. The most widespread term is probably refer, and a thing identified is a referent, as for example in the work of John Lyons. In linguistics, the study of reference relations belongs to pragmatics, the study of language use, though it is also a matter of great interest to philosophers, especially those wishing to understand the nature of knowledge, perception and cognition more generally. Various devices can be used for reference: determiners, pronouns, proper names... Reference relations can be of different kinds; referents can be in a "real" or imaginary world, in discourse itself, and they may be singular, plural, or collective. === Pronouns === The simplest type of referring expressions are pronoun such as he and it. The linguistics and natural language processing communities have developed various models for predicting anaphor referents, such as centering theory, and ideally referring-expression generation would be based on such models. However most NLG systems use much simpler algorithms, for example using a pronoun if the referent was mentioned in the previous sentence (or sentential clause), and no other entity of the same gender was mentioned in this sentence. === Definite noun phrases === There has been a considerable amount of research on generating definite noun phrases, such as the big red book. Much of this builds on the model proposed by Dale and Reiter. This has been extended in various ways, for example Krahmer et al. present a graph-theoretic model of definite NP generation with many nice properties. In recent years a shared-task event has compared different algorithms for definite NP generation, using the TUNA corpus. === Spatial and temporal reference === Recently there has been more research on generating referring expressions for time and space. Such references tend to be imprecise (what is the exact meaning of tonight?), and also to be interpreted in different ways by different people. Hence it may be necessary to explicitly reason about false positive vs false negative tradeoffs, and even calculate the utility of different possible referring expressions in a particular task context. === Criteria for good expressions === Ideally, a good referring expression should satisfy a number of criteria: Referential success: It should unambiguously identify the referent to the reader. Ease of comprehension: The reader should be able to quickly read and understand it. Computational complexity: The generation algorithm should be fast No false inferences: The expression should not confuse or mislead the reader by suggesting false implicatures or other pragmatic inferences. For example, a reader may be confused if he is told Sit by the brown wooden table in a context where there is only one table. == History == === Pre-2000 era === REG goes back to the early days of NLG. One of the first approaches was done by Winograd in 1972 who developed an "incremental" REG algorithm for his SHRDLU program. Afterwards researchers started to model the human abilities to create referring expressions in the 1980s. This new approach to the topic was influenced by the researchers Appelt and Kronfeld who created the programs KAMP and BERTRAND and considered referring expressions as parts of bigger speech acts. Some of their most interesting findings were the fact that referring expressions can be used to add information beyond the identification of the referent as well as the influence of communicative context and the Gricean maxims on referring expressions. Furthermore, its skepticism concerning the naturalness of minimal descriptions made Appelt and Kronfeld's research a foundation of later work on REG. The search for simple, well-defined problems changed the direction of research in the early 1990s. This new approach was led by Dale and Reiter who stressed the identification of the referent as the central goal. Like Appelt they discuss the connection between the Gricean maxims and referring expressions in their culminant paper in which they also propose a formal problem definition. Furthermore, Reiter and Dale discuss the Full Brevity and Greedy Heuristics algorithms as well as their Incremental Algorithm(IA) which became one of the most important algorithms in REG. === Later developments === After 2000 the research began to lift some of the simplifying assumptions, that had been made in early REG research in order to create more simple algorithms. Different research groups concentrated on different limitations creating several expanded algorithms. Often these extend the IA in a single perspective for example in relation to: Reference to Sets like "the t-shirt wearers" or "the green apples and the banana on the left" Relational Descriptions like "the cup on the table" or "the woman who has three children" Context Dependency, Vagueness and Gradeability include statements like "the older man" or "the car on the left" which are often unclear without a context Salience and Generation of Pronouns are highly discourse dependent making for example "she" a reference to "the (most salient) female person" Many simplifying assumptions are still in place or have just begun to be worked on. Also a combination of the different extensions has yet to be done and is called a "non-trivial enterprise" by Krahmer and van Deemter. Another important change after 2000 was the increasing use of empirical studies in order to evaluate algorithms. This development took place due to the emergence of transparent corpora. Although there are still discussions about what the best evaluation metrics are, the use of experimental evaluation has already led to a better comparability of algorithms, a discussion about the goals of REG and more task-oriented research. Furthermore, research has extended its range to related topics such as the choice of Knowledge Representation(KR) Frameworks. In this area the main question, which KR framework is most suitable for the use in REG remains open. The answer to this question depends on how well descriptions can be expressed or found. A lot of the potential of KR frameworks has been left unused so far. Some of the different approaches are the usage of: Graph search which treats relations between targets in the same way as properties. Constraint Satisfaction which allows for a separation between problem specification and the implementation. Modern Knowledge Representation which offers logical inference in for example Description Logic or Conceptual Graphs. == Problem definition == Dale and Reiter (1995) think about referring expressions as distinguishing descriptions. They define: The referent as the entity that should be described The context set as set of salient entities The contrast set or potential distractors as all elements of the context set except the referent A property as a reference to a single attribute–value pair Each entity in the domain can be characterised as a set of attribute–value pairs for example ⟨ {\displaystyle \langle } type, dog ⟩ {\displaystyle \rangle } , ⟨ {\displaystyle \langle } gender, female ⟩ {\displaystyle \rangle } or ⟨ {\displaystyle \langle } age, 10 years ⟩ {\displaystyle \rangle } . The problem then is defined as follows: Let r {\displaystyle r} be the intended referent, and C {\displaystyle C} be the contrast set. Then, a set L {\displaystyle L} of attribute–value pairs will represent a distinguishing description if the following two conditions hold: Every attribute–value pair in L {\displaystyle L} applies to r {\displaystyle r} : that is, every element of L {\displaystyle L} specifies an attribute–value that r {\displaystyle r} possesses. For every member c {\displaystyle c} of C {\displaystyle C} , there is at least one element l {\displaystyle l} of L {\displaystyle L} that does not apply to c {\displaystyle c} : that is, there is an l {\displaystyle l} in L {\displaystyle L} that specifies an attribute–value that c {\displaystyle c} does not possess. l {\displaystyle l} is said

    Read more →
  • Deterministic finite automaton

    Deterministic finite automaton

    In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943. The figure illustrates a deterministic finite automaton using a state diagram. In this example automaton, there are three states: S0, S1, and S2 (denoted graphically by circles). The automaton takes a finite sequence of 0s and 1s as input. For each state, there is a transition arrow leading out to a next state for both 0 and 1. Upon reading a symbol, a DFA jumps deterministically from one state to another by following the transition arrow. For example, if the automaton is currently in state S0 and the current input symbol is 1, then it deterministically jumps to state S1. A DFA has a start state (denoted graphically by an arrow coming in from nowhere) where computations begin, and a set of accept states (denoted graphically by a double circle) which help define when a computation is successful. A DFA is defined as an abstract mathematical concept, but is often implemented in hardware and software for solving various specific problems such as lexical analysis and pattern matching. For example, a DFA can model software that decides whether or not online user input such as email addresses are syntactically valid. DFAs have been generalized to nondeterministic finite automata (NFA) which may have several arrows of the same label starting from a state. Using the powerset construction method, every NFA can be translated to a DFA that recognizes the same language. DFAs, and NFAs as well, recognize exactly the set of regular languages. == Formal definition == A deterministic finite automaton M is a 5-tuple, (Q, Σ, δ, q0, F), consisting of a finite set of states Q a finite set of input symbols called the alphabet Σ a transition function δ : Q × Σ → Q an initial (or start) state q 0 ∈ Q {\displaystyle q_{0}\in Q} a set of accepting (or final) states F ⊆ Q {\displaystyle F\subseteq Q} Let w = a1a2...an be a string over the alphabet Σ. The automaton M accepts the string w if a sequence of states, r0, r1, ..., rn, exists in Q with the following conditions: r0 = q0 ri+1 = δ(ri, ai+1), for i = 0, ..., n − 1 r n ∈ F {\displaystyle r_{n}\in F} . In words, the first condition says that the machine starts in the start state q0. The second condition says that given each character of string w, the machine will transition from state to state according to the transition function δ. The last condition says that the machine accepts w if the last input of w causes the machine to halt in one of the accepting states. Otherwise, it is said that the automaton rejects the string. The set of strings that M accepts is the language recognized by M and this language is denoted by L(M). A deterministic finite automaton without accept states and without a starting state is known as a transition system or semiautomaton. For more comprehensive introduction of the formal definition see automata theory. == Example == The following example is of a DFA M, with a binary alphabet, which requires that the input contains an even number of 0s. M = (Q, Σ, δ, q0, F) where Q = {S1, S2} Σ = {0, 1} q0 = S1 F = {S1} and δ is defined by the following state transition table: The state S1 represents that there has been an even number of 0s in the input so far, while S2 signifies an odd number. A 1 in the input does not change the state of the automaton. When the input ends, the state will show whether the input contained an even number of 0s or not. If the input did contain an even number of 0s, M will finish in state S1, an accepting state, so the input string will be accepted. The language recognized by M is the regular language given by the regular expression (1) (0 (1) 0 (1)), where is the Kleene star, e.g., 1 denotes any number (possibly zero) of consecutive ones. == Variations == === Complete and incomplete === According to the above definition, deterministic finite automata are always complete: they define from each state a transition for each input symbol. While this is the most common definition, some authors use the term deterministic finite automaton for a slightly different notion: an automaton that defines at most one transition for each state and each input symbol; the transition function is allowed to be partial. When no transition is defined, such an automaton halts. === Local automata === A local automaton is a DFA, not necessarily complete, for which all edges with the same label lead to a single vertex. Local automata accept the class of local languages, those for which membership of a word in the language is determined by a "sliding window" of length two on the word. A Myhill graph over an alphabet A is a directed graph with vertex set A and subsets of vertices labelled "start" and "finish". The language accepted by a Myhill graph is the set of directed paths from a start vertex to a finish vertex: the graph thus acts as an automaton. The class of languages accepted by Myhill graphs is the class of local languages. === Randomness === When the start state and accept states are ignored, a DFA of n states and an alphabet of size k can be seen as a digraph of n vertices in which all vertices have k out-arcs labeled 1, ..., k (a k-out digraph). It is known that when k ≥ 2 is a fixed integer, with high probability, the largest strongly connected component (SCC) in such a k-out digraph chosen uniformly at random is of linear size and it can be reached by all vertices. It has also been proven that if k is allowed to increase as n increases, then the whole digraph has a phase transition for strong connectivity similar to Erdős–Rényi model for connectivity. In a random DFA, the maximum number of vertices reachable from one vertex is very close to the number of vertices in the largest SCC with high probability. This is also true for the largest induced sub-digraph of minimum in-degree one, which can be seen as a directed version of 1-core. == Closure properties == If DFAs recognize the languages that are obtained by applying an operation on the DFA recognizable languages then DFAs are said to be closed under the operation. The DFAs are closed under the following operations. For each operation, an optimal construction with respect to the number of states has been determined in state complexity research. Since DFAs are equivalent to nondeterministic finite automata (NFA), these closures may also be proved using closure properties of NFA. == As a transition monoid == A run of a given DFA can be seen as a sequence of compositions of a very general formulation of the transition function with itself. Here we construct that function. For a given input symbol a ∈ Σ {\displaystyle a\in \Sigma } , one may construct a transition function δ a : Q → Q {\displaystyle \delta _{a}:Q\rightarrow Q} by defining δ a ( q ) = δ ( q , a ) {\displaystyle \delta _{a}(q)=\delta (q,a)} for all q ∈ Q {\displaystyle q\in Q} . (This trick is called currying.) From this perspective, δ a {\displaystyle \delta _{a}} "acts" on a state in Q to yield another state. One may then consider the result of function composition repeatedly applied to the various functions δ a {\displaystyle \delta _{a}} , δ b {\displaystyle \delta _{b}} , and so on. Given a pair of letters a , b ∈ Σ {\displaystyle a,b\in \Sigma } , one may define a new function δ ^ a b = δ a ∘ δ b {\displaystyle {\widehat {\delta }}_{ab}=\delta _{a}\circ \delta _{b}} , where ∘ {\displaystyle \circ } denotes function composition. Clearly, this process may be recursively continued, giving the following recursive definition of δ ^ : Q × Σ ⋆ → Q {\displaystyle {\widehat {\delta }}:Q\times \Sigma ^{\star }\rightarrow Q} : δ ^ ( q , ϵ ) = q {\displaystyle {\widehat {\delta }}(q,\epsilon )=q} , where ϵ {\displaystyle \epsilon } is the empty string and δ ^ ( q , w a ) = δ a ( δ ^ ( q , w ) ) {\displaystyle {\widehat {\delta }}(q,wa)=\delta _{a}({\widehat {\delta }}(q,w))} , where w ∈ Σ ∗ , a ∈ Σ {\displaystyle w\in \Sigma ^{},a\in \Sigma } and q ∈ Q {\displaystyle q\in Q} . δ ^ {\displaystyle {\widehat {\delta }}} is defined for all words w ∈ Σ ∗ {\displaystyle w\in \Sigma ^{}} . A run of the DFA is a sequence of compositions of δ ^ {\displaystyle {\widehat {\delta }}} with itself. Repeated function composition forms a monoid. For the transition functions, this monoid is known as the transition monoid, or sometimes the transformation semigroup. The construction can also be reversed: given a δ ^ {\displaystyle {\wide

    Read more →
  • Maike Osborne

    Maike Osborne

    Maike Osborne (born Michael Osborne, 1982) is an Australian academic and scientist who serves as a professor of machine learning at University of Oxford in the Machine Learning Research Group in the Department of Engineering Science. In 2016 she co-founded Mind Foundry, an artificial intelligence company, along with fellow professor Stephen Roberts. == Education == She has a BEng in Mechanical Engineering and a BSc in both Pure Mathematics and Physics from the University of Western Australia. She has a PhD in Machine Learning from the University of Oxford. == Career == Osborne has contributed to over 100 publications, and her work has received over 24,000 citations with an h-index of 46 according to Google Scholar. and has acted as principal or co-investigator for £10.6M of research funding. Her career has focused in particular on Bayesian approaches to AI and machine learning, named after the famous British statistician Thomas Bayes. Osborne's work has contributed to Probabilistic numerics, with Osborne co-authoring the first textbook on the subject. In 2013, Osborne co-authored a paper alongside Swedish-German economist Carl Benedikt Frey called "The Future of Employment: How Susceptible are Jobs to Computerisation?". The paper has received over 13,000 citations and extensive media coverage. In 2023 Osborne gave oral evidence to the UK House of Commons Science and Technology Committee on the subject of the "Governance of Artificial Intelligence". Her testimony received significant coverage around her warnings of the threat of "rogue AI". == Honors == She is also an Official Fellow of Exeter College, and St Peter's College, Oxford, a Fellow of the ELLIS society, and a Faculty Member of the Oxford-Man Institute of Quantitative Finance. She joined the Oxford Martin School as Lead Researcher on the Oxford Martin Programme on Technology and Employment in 2015. She is a Director of the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems.

    Read more →
  • Jian Ma (computational biologist)

    Jian Ma (computational biologist)

    Jian Ma (Chinese: 马坚) is an American computer scientist and computational biologist. He is the Ray and Stephanie Lane Professor of Computational Biology in the School of Computer Science at Carnegie Mellon University. He is a faculty member in the Ray and Stephanie Lane Computational Biology Department. His lab develops AI/ML methods to study the structure and function of the human genome and cellular organization and their implications for health and disease. During his Ph.D. and postdoc training, he developed algorithms to reconstruct the ancestral mammalian genome and evolutionary history. His research group has recently pioneered a series of new machine learning solutions for 3D genome organization, single-cell epigenomics, spatial omics, and complex molecular interactions. His lab also explores large language models to uncover gene regulatory mechanisms and the intricate connections among cellular components, with the aim of driving discovery and guiding experimentation. He received an NSF CAREER award in 2011. In 2020, he was awarded a Guggenheim Fellowship in Computer Science. He received the Allen Newell Award for Research Excellence (2025). He is an elected Fellow of the American Association for the Advancement of Science, the American Institute for Medical and Biological Engineering, the International Society for Computational Biology, and the Association for Computing Machinery. He leads an NIH 4D Nucleome Center to develop machine learning algorithms to better understand the cell nucleus. He served as the Program Chair for RECOMB 2024. He is also a member of the Scientific Advisory Board of the Chan Zuckerberg Biohub Chicago (CZ Biohub Chicago) and the RECOMB Steering Committee. In 2024, he launched the Center for AI-Driven Biomedical Research (AI4BIO) at CMU, which will be a catalyst for innovations at the intersection of AI and biomedicine across the School of Computer Science and campus. == Selected Recent Publications == Chen V#, Yang M#, Cui W, Kim JS, Talwalkar A, and Ma J. Applying interpretable machine learning in computational biology - pitfalls, recommendations and opportunities for new developments. Nature Methods, 21(8):1454-1461, 2024. Xiong K#, Zhang R#, and Ma J. scGHOST: Identifying single-cell 3D genome subcompartments. Nature Methods, 21(5):814-822, 2024. Zhou T, Zhang R, Jia D, Doty RT, Munday AD, Gao D, Xin L, Abkowitz JL, Duan Z, and Ma J. GAGE-seq concurrently profiles multiscale 3D genome organization and gene expression in single cells. Nature Genetics, 56(8):1701-1711, 2024. Zhang Y, Boninsegna L, Yang M, Misteli T, Alber F, and Ma J. Computational methods for analysing multiscale 3D genome organization. Nature Reviews Genetics, 5(2):123-141, 2024. Chidester B#, Zhou T#, Alam S, and Ma J. SPICEMIX enables integrative single-cell spatial modeling of cell identity. Nature Genetics, 55(1):78-88, 2023. [Cover Article] Zhang R#, Zhou T#, and Ma J. Ultrafast and interpretable single-cell 3D genome analysis with Fast-Higashi. Cell Systems, 13(10):P798-807.E6, 2022. [Cover Article] Zhu X#, Zhang Y#, Wang Y, Tian D, Belmont AS, Swedlow JR, and Ma J. Nucleome Browser: An integrative and multimodal data navigation platform for 4D Nucleome. Nature Methods, 19(8):911-913, 2022. Zhang R, Zhou T, and Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nature Biotechnology, 40:254–261, 2022.

    Read more →
  • Evolvability (computer science)

    Evolvability (computer science)

    The term evolvability is a framework of computational learning introduced by Leslie Valiant in his paper of the same name. The aim of this theory is to model biological evolution and categorize which types of mechanisms are evolvable. Evolution is an extension of PAC learning and learning from statistical queries. == General framework == Let F n {\displaystyle F_{n}\,} and R n {\displaystyle R_{n}\,} be collections of functions on n {\displaystyle n\,} variables. Given an ideal function f ∈ F n {\displaystyle f\in F_{n}} , the goal is to find by local search a representation r ∈ R n {\displaystyle r\in R_{n}} that closely approximates f {\displaystyle f\,} . This closeness is measured by the performance Perf ⁡ ( f , r ) {\displaystyle \operatorname {Perf} (f,r)} of r {\displaystyle r\,} with respect to f {\displaystyle f\,} . As is the case in the biological world, there is a difference between genotype and phenotype. In general, there can be multiple representations (genotypes) that correspond to the same function (phenotype). That is, for some r , r ′ ∈ R n {\displaystyle r,r'\in R_{n}} , with r ≠ r ′ {\displaystyle r\neq r'\,} , still r ( x ) = r ′ ( x ) {\displaystyle r(x)=r'(x)\,} for all x ∈ X n {\displaystyle x\in X_{n}} . However, this need not be the case. The goal then, is to find a representation that closely matches the phenotype of the ideal function, and the spirit of the local search is to allow only small changes in the genotype. Let the neighborhood N ( r ) {\displaystyle N(r)\,} of a representation r {\displaystyle r\,} be the set of possible mutations of r {\displaystyle r\,} . For simplicity, consider Boolean functions on X n = { − 1 , 1 } n {\displaystyle X_{n}=\{-1,1\}^{n}\,} , and let D n {\displaystyle D_{n}\,} be a probability distribution on X n {\displaystyle X_{n}\,} . Define the performance in terms of this. Specifically, Perf ⁡ ( f , r ) = ∑ x ∈ X n f ( x ) r ( x ) D n ( x ) . {\displaystyle \operatorname {Perf} (f,r)=\sum _{x\in X_{n}}f(x)r(x)D_{n}(x).} Note that Perf ⁡ ( f , r ) = Prob ⁡ ( f ( x ) = r ( x ) ) − Prob ⁡ ( f ( x ) ≠ r ( x ) ) . {\displaystyle \operatorname {Perf} (f,r)=\operatorname {Prob} (f(x)=r(x))-\operatorname {Prob} (f(x)\neq r(x)).} In general, for non-Boolean functions, the performance will not correspond directly to the probability that the functions agree, although it will have some relationship. Throughout an organism's life, it will only experience a limited number of environments, so its performance cannot be determined exactly. The empirical performance is defined by Perf s ⁡ ( f , r ) = 1 s ∑ x ∈ S f ( x ) r ( x ) , {\displaystyle \operatorname {Perf} _{s}(f,r)={\frac {1}{s}}\sum _{x\in S}f(x)r(x),} where S {\displaystyle S\,} is a multiset of s {\displaystyle s\,} independent selections from X n {\displaystyle X_{n}\,} according to D n {\displaystyle D_{n}\,} . If s {\displaystyle s\,} is large enough, evidently Perf s ⁡ ( f , r ) {\displaystyle \operatorname {Perf} _{s}(f,r)} will be close to the actual performance Perf ⁡ ( f , r ) {\displaystyle \operatorname {Perf} (f,r)} . Given an ideal function f ∈ F n {\displaystyle f\in F_{n}} , initial representation r ∈ R n {\displaystyle r\in R_{n}} , sample size s {\displaystyle s\,} , and tolerance t {\displaystyle t\,} , the mutator Mut ⁡ ( f , r , s , t ) {\displaystyle \operatorname {Mut} (f,r,s,t)} is a random variable defined as follows. Each r ′ ∈ N ( r ) {\displaystyle r'\in N(r)} is classified as beneficial, neutral, or deleterious, depending on its empirical performance. Specifically, r ′ {\displaystyle r'\,} is a beneficial mutation if Perf s ⁡ ( f , r ′ ) − Perf s ⁡ ( f , r ) ≥ t {\displaystyle \operatorname {Perf} _{s}(f,r')-\operatorname {Perf} _{s}(f,r)\geq t} ; r ′ {\displaystyle r'\,} is a neutral mutation if − t < Perf s ⁡ ( f , r ′ ) − Perf s ⁡ ( f , r ) < t {\displaystyle -t<\operatorname {Perf} _{s}(f,r')-\operatorname {Perf} _{s}(f,r) 0 {\displaystyle \epsilon >0\,} , for all ideal functions f ∈ F n {\displaystyle f\in F_{n}} and representations r 0 ∈ R n {\displaystyle r_{0}\in R_{n}} , with probability at least 1 − ϵ {\displaystyle 1-\epsilon \,} , Perf ⁡ ( f , r g ( n , 1 / ϵ ) ) ≥ 1 − ϵ , {\displaystyle \operatorname {Perf} (f,r_{g(n,1/\epsilon )})\geq 1-\epsilon ,} where the sizes of neighborhoods N ( r ) {\displaystyle N(r)\,} for r ∈ R n {\displaystyle r\in R_{n}\,} are at most p ( n , 1 / ϵ ) {\displaystyle p(n,1/\epsilon )\,} , the sample size is s ( n , 1 / ϵ ) {\displaystyle s(n,1/\epsilon )\,} , the tolerance is t ( 1 / n , ϵ ) {\displaystyle t(1/n,\epsilon )\,} , and the generation size is g ( n , 1 / ϵ ) {\displaystyle g(n,1/\epsilon )\,} . F {\displaystyle F\,} is evolvable over D {\displaystyle D\,} if it is evolvable by some R {\displaystyle R\,} over D {\displaystyle D\,} . F {\displaystyle F\,} is evolvable if it is evolvable over all distributions D {\displaystyle D\,} . == Results == The class of conjunctions and the class of disjunctions are evolvable over the uniform distribution for short conjunctions and disjunctions, respectively. The class of parity functions (which evaluate to the parity of the number of true literals in a given subset of literals) are not evolvable, even for the uniform distribution. Evolvability implies PAC learnability.

    Read more →
  • Tamara Berg

    Tamara Berg

    Tamara Lee Berg is a tenured associate professor at the University of North Carolina at Chapel Hill and a research scientist manager at Facebook AML/FAIR. == Education == Berg obtained her PhD in computer science from the University of California, Berkeley in 2007 as a member of the Berkeley Computer Vision Group. She was an assistant professor at Stony Brook University from 2008 to 2013 before joining University of North Carolina Chapel Hill in 2013. == Research == Berg's research interests are at the boundary of computer vision and natural language processing. In particular, she focuses on understanding the connections between vision and language, for example, to automatically identify people in news photographs, for generating natural language descriptions for images, or for recognising clothing and style. == Selected awards and honours == 2019 Mark Everingham Prize 2013 Marr Prize at the International Conference on Computer Vision 2011 National Science Foundation Career Award == Personal life == Berg is married to fellow computer vision researcher Alexander Berg.

    Read more →
  • Multiple sequence alignment

    Multiple sequence alignment

    Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations (single amino acid or nucleotide changes), insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides. Multiple sequence alignments require more sophisticated methodologies than pairwise alignments, as they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. However, heuristic methods generally cannot guarantee high-quality solutions and have been shown to fail to yield near-optimal solutions on benchmark test cases. == Problem statement == Given m {\displaystyle m} sequences S i {\displaystyle S_{i}} , i = 1 , ⋯ , m {\displaystyle i=1,\cdots ,m} similar to the form below: S := { S 1 = ( S 11 , S 12 , … , S 1 n 1 ) S 2 = ( S 21 , S 22 , ⋯ , S 2 n 2 ) ⋮ S m = ( S m 1 , S m 2 , … , S m n m ) {\displaystyle S:={\begin{cases}S_{1}=(S_{11},S_{12},\ldots ,S_{1n_{1}})\\S_{2}=(S_{21},S_{22},\cdots ,S_{2n_{2}})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S_{m}=(S_{m1},S_{m2},\ldots ,S_{mn_{m}})\end{cases}}} A multiple sequence alignment is taken of this set of sequences S {\displaystyle S} by inserting any amount of gaps needed into each of the S i {\displaystyle S_{i}} sequences of S {\displaystyle S} until the modified sequences, S i ′ {\displaystyle S'_{i}} , all conform to length L ≥ max { n i ∣ i = 1 , … , m } {\displaystyle L\geq \max\{n_{i}\mid i=1,\ldots ,m\}} and no values in the sequences of S {\displaystyle S} of the same column consists of only gaps. The mathematical form of an MSA of the above sequence set is shown below: S ′ := { S 1 ′ = ( S 11 ′ , S 12 ′ , … , S 1 L ′ ) S 2 ′ = ( S 21 ′ , S 22 ′ , … , S 2 L ′ ) ⋮ S m ′ = ( S m 1 ′ , S m 2 ′ , … , S m L ′ ) {\displaystyle S':={\begin{cases}S'_{1}=(S'_{11},S'_{12},\ldots ,S'_{1L})\\S'_{2}=(S'_{21},S'_{22},\ldots ,S'_{2L})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S'_{m}=(S'_{m1},S'_{m2},\ldots ,S'_{mL})\end{cases}}} To return from each particular sequence S i ′ {\displaystyle S'_{i}} to S i {\displaystyle S_{i}} , remove all gaps. == Graphing approach == A general approach when calculating multiple sequence alignments is to use graphs to identify all of the different alignments. When finding alignments via graph, a complete alignment is created in a weighted graph that contains a set of vertices and a set of edges. Each of the graph edges has a weight based on a certain heuristic that helps to score each alignment or subset of the original graph. === Tracing alignments === When determining the best suited alignments for each MSA, a trace is usually generated. A trace is a set of realized, or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences. == Alignment methods == There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences. === Dynamic programming === A direct method for producing an MSA uses the dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment. For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. Expressed with the big O notation commonly used to measure computational complexity, a naïve MSA takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has been shown to be an NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm, Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space. In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called sum of pair score) and has been implemented in a software program for constructing multiple sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity. === Progressive alignment construction === The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987. Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a phylogenetic tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or unweighted pair group method with arithmetic mean (UPGMA), and may use distances based on the number of identical two-letter sub-sequences (as in FASTA rather than a dynamic programming alignment). Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program. Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. A popular progressive alignment method has been the Clustal family. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for protein structure prediction by homology modeling. European Bioinformatics Institute (EMBL-EBI) announced that CLustalW2 will expire in August 2015. They recommend Clustal Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. An alternative tool for progressive DNA alignments is multiple alignment using fast Fourier transform (MAFFT). Another common progressive alignment method named T-Coffee is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate w

    Read more →