AI Data Trainer/annotator

AI Data Trainer/annotator — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Pill reminder

    Pill reminder

    A pill reminder is any device that reminds users to take medications. Traditional pill reminders are pill containers with electric timers attached, which can be preset for certain times of the day to set off an alarm. More sophisticated pill reminders can also detect when they have been opened, and therefore when the user is away during the time they were supposed to take their medication, they will be reminded of it when they return. This reminder can be in the form of a light, which also helps for deaf or hearing-impaired users. == Mobile app == A newer type of pill reminder is a mobile app that reminds the owner to take the medication. Some of these applications might effectively support adherence to taking medications.

    Read more →
  • Dr. Sbaitso

    Dr. Sbaitso

    Dr. Sbaitso ( SPAYT-soh) is an artificial intelligence speech synthesis program released late in 1991 by Creative Labs in Singapore for MS-DOS-based personal computers. The name is an acronym for "SoundBlaster Acting Intelligent Text-to-Speech Operator." == History == Dr. Sbaitso was distributed with various sound cards manufactured by Creative Technology in the early 1990s. The text-to-speech engine used is a version of Monologue, which was developed by First Byte Software. Monologue is a later release of First Byte's "SmoothTalker" software from 1984. The program "conversed" with the user as if it were a psychologist, though most of its responses were along the lines of "WHY DO YOU FEEL THAT WAY?" rather than any sort of complicated interaction. When confronted with a phrase it could not understand, it would often reply with something such as "THAT'S NOT MY PROBLEM." Dr. Sbaitso repeated text out loud that was typed after the word "SAY." Repeated swearing or abusive behavior on the part of the user caused Dr. Sbaitso to "break down" in a "PARITY ERROR" before resetting itself. The same would happen, if the user types "SAY PARITY." The program introduced itself with the following lines: HELLO [UserName], MY NAME IS DOCTOR SBAITSO. I AM HERE TO HELP YOU. SAY WHATEVER IS IN YOUR MIND FREELY, OUR CONVERSATION WILL BE KEPT IN STRICT CONFIDENCE. MEMORY CONTENTS WILL BE WIPED OFF AFTER YOU LEAVE, SO, TELL ME ABOUT YOUR PROBLEMS. The program was designed to showcase the digitized voices the cards were able to produce, though the quality was far from lifelike. Additionally, there was a version of this program for Microsoft Windows through the use of a program called Prody Parrot; this version of the software featured a more detailed graphical user interface. The text-to-speech was also used as the voice of 1st Prize from the Baldi's Basics series, albeit slowed down. == Commands == If the user submits "HELP", a list of commands will appear. If the user then submits "M", more commands will appear. There are three pages of commands in total, with guidance on how to use each of the features.

    Read more →
  • Eigenface

    Eigenface

    An eigenface ( EYE-gən-) is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set. == History == The eigenface approach began with a search for a low-dimensional representation of face images. Sirovich and Kirby showed that principal component analysis could be used on a collection of face images to form a set of basis features. These basis images, known as eigenpictures, could be linearly combined to reconstruct images in the original training set. If the training set consists of M images, principal component analysis could form a basis set of N images, where N < M. The reconstruction error is reduced by increasing the number of eigenpictures; however, the number needed is always chosen less than M. For example, if you need to generate a number of N eigenfaces for a training set of M face images, you can say that each face image can be made up of "proportions" of all the K "features" or eigenfaces: Face image1 = (23% of E1) + (2% of E2) + (51% of E3) + ... + (1% En). In 1991 M. Turk and A. Pentland expanded these results and presented the eigenface method of face recognition. In addition to designing a system for automated face recognition using eigenfaces, they showed a way of calculating the eigenvectors of a covariance matrix such that computers of the time could perform eigen-decomposition on a large number of face images. Face images usually occupy a high-dimensional space and conventional principal component analysis was intractable on such data sets. Turk and Pentland's paper demonstrated ways to extract the eigenvectors based on matrices sized by the number of images rather than the number of pixels. Once established, the eigenface method was expanded to include methods of preprocessing to improve accuracy. Multiple manifold approaches were also used to build sets of eigenfaces for different subjects and different features, such as the eyes. == Generation == A set of eigenfaces can be generated by performing a mathematical process called principal component analysis (PCA) on a large set of images depicting different human faces. Informally, eigenfaces can be considered a set of "standardized face ingredients", derived from statistical analysis of many pictures of faces. Any human face can be considered to be a combination of these standard faces. For example, one's face might be composed of the average face plus 10% from eigenface 1, 55% from eigenface 2, and even −3% from eigenface 3. Remarkably, it does not take many eigenfaces combined together to achieve a fair approximation of most faces. Also, because a person's face is not recorded by a digital photograph, but instead as just a list of values (one value for each eigenface in the database used), much less space is taken for each person's face. The eigenfaces that are created will appear as light and dark areas that are arranged in a specific pattern. This pattern is how different features of a face are singled out to be evaluated and scored. There will be a pattern to evaluate symmetry, whether there is any style of facial hair, where the hairline is, or an evaluation of the size of the nose or mouth. Other eigenfaces have patterns that are less simple to identify, and the image of the eigenface may look very little like a face. The technique used in creating eigenfaces and using them for recognition is also used outside of face recognition: handwriting recognition, lip reading, voice recognition, sign language/hand gestures interpretation and medical imaging analysis. Therefore, some do not use the term eigenface, but prefer to use 'eigenimage'. === Practical implementation === To create a set of eigenfaces, one must: Prepare a training set of face images. The pictures constituting the training set should have been taken under the same lighting conditions, and must be normalized to have the eyes and mouths aligned across all images. They must also be all resampled to a common pixel resolution (r × c). Each image is treated as one vector, simply by concatenating the rows of pixels in the original image, resulting in a single column with r × c elements. For this implementation, it is assumed that all images of the training set are stored in a single matrix T, where each column of the matrix is an image. Subtract the mean. The average image a has to be calculated and then subtracted from each original image in T. Calculate the eigenvectors and eigenvalues of the covariance matrix S. Each eigenvector has the same dimensionality (number of components) as the original images, and thus can itself be seen as an image. The eigenvectors of this covariance matrix are therefore called eigenfaces. They are the directions in which the images differ from the mean image. Usually this will be a computationally expensive step (if at all possible), but the practical applicability of eigenfaces stems from the possibility to compute the eigenvectors of S efficiently, without ever computing S explicitly, as detailed below. Choose the principal components. Sort the eigenvalues in descending order and arrange eigenvectors accordingly. The number of principal components k is determined arbitrarily by setting a threshold ε on the total variance. Total variance ⁠ v = ( λ 1 + λ 2 + . . . + λ n ) {\displaystyle v=(\lambda _{1}+\lambda _{2}+...+\lambda _{n})} ⁠, n = number of components, and λ {\displaystyle \lambda } represents component eigenvalue. k is the smallest number that satisfies ( λ 1 + λ 2 + . . . + λ k ) v > ϵ {\displaystyle {\frac {(\lambda _{1}+\lambda _{2}+...+\lambda _{k})}{v}}>\epsilon } These eigenfaces can now be used to represent both existing and new faces: we can project a new (mean-subtracted) image on the eigenfaces and thereby record how that new face differs from the mean face. The eigenvalues associated with each eigenface represent how much the images in the training set vary from the mean image in that direction. Information is lost by projecting the image on a subset of the eigenvectors, but losses are minimized by keeping those eigenfaces with the largest eigenvalues. For instance, working with a 100 × 100 image will produce 10,000 eigenvectors. In practical applications, most faces can typically be identified using a projection on between 100 and 150 eigenfaces, so that most of the 10,000 eigenvectors can be discarded. === Matlab example code === Here is an example of calculating eigenfaces with Extended Yale Face Database B. To evade computational and storage bottleneck, the face images are sampled down by a factor 4×4=16. Note that although the covariance matrix S generates many eigenfaces, only a fraction of those are needed to represent the majority of the faces. For example, to represent 95% of the total variation of all face images, only the first 43 eigenfaces are needed. To calculate this result, implement the following code: === Computing the eigenvectors === Performing PCA directly on the covariance matrix of the images is often computationally infeasible. If small images are used, say 100 × 100 pixels, each image is a point in a 10,000-dimensional space and the covariance matrix S is a matrix of 10,000 × 10,000 = 108 elements. However the rank of the covariance matrix is limited by the number of training examples: if there are N training examples, there will be at most N − 1 eigenvectors with non-zero eigenvalues. If the number of training examples is smaller than the dimensionality of the images, the principal components can be computed more easily as follows. Let T be the matrix of preprocessed training examples, where each column contains one mean-subtracted image. The covariance matrix can then be computed as S = TTT and the eigenvector decomposition of S is given by S v i = T T T v i = λ i v i {\displaystyle \mathbf {Sv} _{i}=\mathbf {T} \mathbf {T} ^{T}\mathbf {v} _{i}=\lambda _{i}\mathbf {v} _{i}} However TTT is a large matrix, and if instead we take the eigenvalue decomposition of T T T u i = λ i u i {\displaystyle \mathbf {T} ^{T}\mathbf {T} \mathbf {u} _{i}=\lambda _{i}\mathbf {u} _{i}} then we notice that by pre-multiplying both sides of the equation with T, we obtain T T T T u i = λ i T u i {\displaystyle \mathbf {T} \mathbf {T} ^{T}\mathbf {T} \mathbf {u} _{i}=\lambda _{i}\mathbf {T} \mathbf {u} _{i}} Meaning that, if ui is an eigenvector of TTT, then vi = Tui is an eigenvector of S. If we have

    Read more →
  • Automatic taxonomy construction

    Automatic taxonomy construction

    Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence. A taxonomy (or taxonomical classification) is a scheme of classification, especially, a hierarchical classification, in which things are organized into groups or types. Among other things, a taxonomy can be used to organize and index knowledge (stored as documents, articles, videos, etc.), such as in the form of a library classification system, or a search engine taxonomy, so that users can more easily find the information they are searching for. Many taxonomies are hierarchies (and thus, have an intrinsic tree structure), but not all are. Manually developing and maintaining a taxonomy is a labor-intensive task requiring significant time and resources, including familiarity of or expertise in the taxonomy's domain (scope, subject, or field), which drives the costs and limits the scope of such projects. Also, domain modelers have their own points of view which inevitably, even if unintentionally, work their way into the taxonomy. ATC uses artificial intelligence techniques to quickly automatically generate a taxonomy for a domain in order to avoid these problems and remove limitations. == Approaches == There are several approaches to ATC. One approach is to use rules to detect patterns in the corpus and use those patterns to infer relations such as hyponymy. Other approaches use machine learning techniques such as Bayesian inferencing and Artificial Neural Networks. === Keyword extraction === One approach to building a taxonomy is to automatically gather the keywords from a domain using keyword extraction, then analyze the relationships between them (see Hyponymy, below), and then arrange them as a taxonomy based on those relationships. === Hyponymy and "is-a" relations === In ATC programs, one of the most important tasks is the discovery of hypernym and hyponym relations among words. One way to do that from a body of text is to search for certain phrases like "is a" and "such as". In linguistics, is-a relations are called hyponymy. Words that describe categories are called hypernyms and words that are examples of categories are hyponyms. For example, dog is a hypernym and Fido is one of its hyponyms. A word can be both a hyponym and a hypernym. So, dog is a hyponym of mammal and also a hypernym of Fido. Taxonomies are often represented as is-a hierarchies where each level is more specific than (in mathematical language "a subset of") the level above it. For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered instances of a concept. For example, Fido is-a instance of the concept dog and Fluffy is-a cat. == Applications == ATC can be used to build taxonomies for search engines, to improve search results. ATC systems are a key component of ontology learning (also known as automatic ontology construction), and have been used to automatically generate large ontologies for domains such as insurance and finance. They have also been used to enhance existing large networks such as Wordnet to make them more complete and consistent. == ATC software == == Other names == Other names for automatic taxonomy construction include: Automated outline building Automated outline construction Automated outline creation Automated outline extraction Automated outline generation Automated outline induction Automated outline learning Automated outlining Automated taxonomy building Automated taxonomy construction Automated taxonomy creation Automated taxonomy extraction Automated taxonomy generation Automated taxonomy induction Automated taxonomy learning Automatic outline building Automatic outline construction Automatic outline creation Automatic outline extraction Automatic outline generation Automatic outline induction Automatic outline learning Automatic taxonomy building Automatic taxonomy creation Automatic taxonomy extraction Automatic taxonomy generation Automatic taxonomy induction Automatic taxonomy learning Outline automation Outline building Outline construction Outline creation Outline extraction Outline generation Outline induction Outline learning Semantic taxonomy building Semantic taxonomy construction Semantic taxonomy creation Semantic taxonomy extraction Semantic taxonomy generation Semantic taxonomy induction Semantic taxonomy learning Taxonomy automation Taxonomy building Taxonomy construction Taxonomy creation Taxonomy extraction Taxonomy generation Taxonomy induction Taxonomy learning

    Read more →
  • Load file

    Load file

    A load file in the litigation community is commonly referred to as the file used to import data (coded, captured or extracted data from ESI processing) into a database; or the file used to link images. These load files carry commands, commanding the software to carry out certain functions with the data found in them. Load files are usually ASCII text files that have delimited fields of information. Such load files may have data about documents to be imported into a document management software such as Concordance or Summation. Or they may have the path or directory where images may reside so that the software can link such images to their corresponding records. Some database programs take one load file for importing images and another for importing data while others take only one load file for both pieces of information. OCR or Search-able Text which is considered "data" is also imported into most database programs via the same load files. Though some people prefer to load the OCR into their databases by running a separate command to search and find the desired text. Commonly used databases and their corresponding file extensions are: Summation (DII , CSV), Concordance (OPT, DAT), Sanction (SDT), IPRO (LFP), Ringtail (MDB) and DB/TextWorks (TXT).

    Read more →
  • XLNet

    XLNet

    The XLNet was an autoregressive Transformer designed as an improvement over BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June 2019, under the Apache 2.0 license. It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference. == Architecture == The main idea of XLNet is to model language autoregressively like the GPT models, but allow for all possible permutations of a sentence. Concretely, consider the following sentence:My dog is cute.In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context: We factorize the joint probability of a sequence of words x 1 , … , x T {\displaystyle x_{1},\ldots ,x_{T}} using the chain rule: Pr ( x 1 , … , x T ) = Pr ( x 1 ) Pr ( x 2 | x 1 ) Pr ( x 3 | x 1 , x 2 ) … Pr ( x T | x 1 , … , x T − 1 ) . {\displaystyle \Pr(x_{1},\ldots ,x_{T})=\Pr(x_{1})\Pr(x_{2}|x_{1})\Pr(x_{3}|x_{1},x_{2})\ldots \Pr(x_{T}|x_{1},\ldots ,x_{T-1}).} For example, the sentence "My dog is cute" is factorized as: Pr ( My , dog , is , cute ) = Pr ( My ) Pr ( dog | My ) Pr ( is | My , dog ) Pr ( cute | My , dog , is ) . {\displaystyle \Pr({\text{My}},{\text{dog}},{\text{is}},{\text{cute}})=\Pr({\text{My}})\Pr({\text{dog}}|{\text{My}})\Pr({\text{is}}|{\text{My}},{\text{dog}})\Pr({\text{cute}}|{\text{My}},{\text{dog}},{\text{is}}).} Schematically, we can write it as → My → My dog → My dog is → My dog is cute . {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My }}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My dog }}{\texttt {}}{\texttt {}}\to {\text{My dog is }}{\texttt {}}\to {\text{My dog is cute}}.} However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task: is dog is dog is cute → My dog is cute {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\texttt {}}{\texttt {}}{\text{is }}{\texttt {}}\to {\texttt {}}{\text{dog is }}{\texttt {}}\to {\texttt {}}{\text{dog is cute}}\to {\text{My dog is cute}}} By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words. === Two-Stream Self-Attention === To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are: Content stream: This stream encodes the content of each word, as in standard causally masked self-attention. Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream. The content stream uses the causal mask M causal = [ 0 − ∞ − ∞ … − ∞ 0 0 − ∞ … − ∞ 0 0 0 … − ∞ ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 … 0 ] {\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}} permuted by a random permutation matrix to P M causal P − 1 {\displaystyle PM_{\text{causal}}P^{-1}} . The query stream uses the cross-attention mask P ( M causal − ∞ I ) P − 1 {\displaystyle P(M_{\text{causal}}-\infty I)P^{-1}} , where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is. Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass. == Training == Two models were released: XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads. It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of BooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, and Common Crawl. It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an Adam optimizer, linear learning rate decay, and a batch size of 8192.

    Read more →
  • History of natural language processing

    History of natural language processing

    The history of natural language processing describes the advances of natural language processing. There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence. == Early history == The history of machine translation dates back to the seventeenth century, when philosophers such as Leibniz and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine. The first patents for "translating machines" were applied for in the mid-1930s. One proposal, by Georges Artsrouni, was simply an automatic bilingual dictionary using paper tape. The other proposal, by Peter Troyanskii, a Russian, was more detailed. Troyanskii’s proposal included both the bilingual dictionary and a method for dealing with grammatical roles between languages, based on Esperanto. == Logical period == In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human. In 1957, Noam Chomsky’s Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule-based system of syntactic structures. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies. In 1969 Roger Schank introduced the conceptual dependency theory for natural language understanding. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at Yale University, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input. Instead of phrase structure rules ATNs used an equivalent set of finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. == Statistical period == Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. === Datasets === The emergence of statistical approaches was aided by both increase in computing power and the availability of large datasets. At that time, large multilingual corpora were starting to emerge. Notably, some were produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. Many of the notable early successes occurred in the field of machine translation. In 1993, the IBM alignment models were used for statistical machine translation. Compared to previous machine translation systems, which were symbolic systems manually coded by computational linguists, these systems were statistical, which allowed them to automatically learn from large textual corpora. Though these systems do not work well in situations where only small corpora is available, so data-efficient methods continue to be an area of research and development. In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word disambiguation. To take advantage of large, unlabelled datasets, algorithms were developed for unsupervised and self-supervised learning. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results. == Neural period == Neural language models were developed in 1990s. In 1990, the Elman network, using a recurrent neural network, encoded each word in a training set as a vector, called a word embedding, and the whole vocabulary as a vector database, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple multilayer perceptron. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of homonyms. Yoshua Bengio developed the first neural probabilistic language model in 2000. Novel algorithms, availability of larger datasets and higher processing power made possible training of larger and larger language models. Attention mechanism was introduced by Bahdanau et al. in 2014. This work laid the foundations for the famous "Attention Is All You Need" paper that introduced the Transformer architecture in 2017. The concept of large language model (LLM) emerged in late 2010s. LLM is a language model trained with self-supervised learning on vast amount of text. Earliest public LLMs had hundreds of millions of parameters, but this number quickly rose to billion and even trillions. In recent years, advancements in deep learning and large language models have significantly enhanced the capabilities of natural language processing, leading to widespread applications in areas such as healthcare, customer service, and content generation. == Software ==

    Read more →
  • LanguageWare

    LanguageWare

    LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries that provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses a finite-state machine approach at multiple levels, which aids its performance characteristics while maintaining a reasonably small footprint. The behaviour of the system is driven by a set of configurable lexico-semantic resources which describe the characteristics and domain of the processed language. A default set of resources comes as part of LanguageWare and these describe the native language characteristics, such as morphology, and the basic vocabulary for the language. Supplemental resources have been created that capture additional vocabularies, terminologies, rules and grammars, which may be generic to the language or specific to one or more domains. A set of Eclipse-based customization tooling, LanguageWare Resource Workbench, is available on IBM's alphaWorks site, and allows domain knowledge to be compiled into these resources and thereby incorporated into the analysis process. LanguageWare can be deployed as a set of UIMA-compliant annotators, Eclipse plug-ins or Web Services.

    Read more →
  • MSpy

    MSpy

    mSpy is a brand of mobile and computer parental control monitoring software for iOS, Android, Windows, and macOS. The app monitors and logs user activity on the client device and sends the data to a personalized dashboard. Data the users can monitor includes text messages, calls, GPS locations, social media chats, and more. It is owned by Virtuoso Holding. == History == mSpy was launched as a product for mobile monitoring by Altercon Group in 2010. In 2012, the application allowed parents to monitor not only smartphones but also computers running Windows and macOS. In 2013, mSpy became TopTenReviews cell phone monitoring software award winner. By 2014, the business grew nearly 400%, and the app's user numbers exceeded 1 million. In 2015, mSpy received the Parents Tested Parents Approved (PTPA) Winner’s Seal of Approval in the United States. In 2015 and 2018, mSpy was the victim of data breaches which released user data. In 2016, mLite, a light version of mSpy, became available from Google Play. The same year, it was awarded the kidSAFE Certified Seal in the United States. In 2017, mSpy collaborated with YouTuber and journalist Coby Persin to conduct a social experiment on the dangers of social media and online predators. A social experiment, conducted with parental consent, involved Coby Persin to befriend three children—aged 12, 13, and 14—via Snapchat and then invite them to meet personally. Each of the participants agreed to the meeting and arrived at the designated location. The video of the experiment received widespread attention and helped to raise awareness about the importance of online security and parental controls. In early 2021, mSpy released a new feature - Screenrecorder. The feature allows parents to take screenshots of the kid's screen when they are browsing certain apps. In 2024, mSpy's Zendesk was compromised by an unknown threat actor, revealing their customer list. As of 2025, mSpy is compatible with Android, iPhone, and iPad devices. It provides access to various types of data stored on the device, including contact information, calendar entries, emails, SMS messages, browser history, photos, videos, and installed applications. Functions also include GPS tracking, geofencing, keyword alerts etc. == Reception == It was noted that since MSpy runs inconspicuously, there is risk of the software being used illegally. mSpy was called "terrifying" by The Next Web and was featured in NPR coverage of spyware used against victims of stalking and other domestic violence. In response mSpy released security updates aimed at reducing the risk of misuse and stated that it "uses encryption protocols to protect user data and that access is restricted to the account holder". In May 2015, Brian Krebs reported that mSpy was hacked, leaking personal data for hundreds of thousands of users of devices with mSpy installed. mSpy claimed that there was no data leak, but that instead, it was the victim of blackmailers. In September 2018, Krebs claimed and demonstrated that anyone could easily gain access to the mSpy database containing data for millions of users. The company responded by stating that the exposed data consisted primarily of error logs and incorrect login attempts. Following the incident, mSpy implemented new security measures, changed encryption keys, and reset passwords for affected accounts. A 2024 Sky News story characterised mSpy as "stalkerware". Leaked customer support messages from mSpy reveal misuse of its app for illegally monitoring partners and children.

    Read more →
  • Brave Leo

    Brave Leo

    Brave Leo is a large language model-based chatbot developed by Brave Software and included with the Brave browser. == History == In November 2023, the company said versions for iOS and Android would be available "in the coming months". == Features == Since January 2024, Leo has used the open-source Mixtral 8x7B from Mistral AI as its default large language model, in addition to LLaMA 2 from Meta Platforms and Claude from Anthropic, both of which have been used previously. Leo can suggest follow-up questions, and summarize webpages, PDFs, and videos. Leo has a $15 (US) per month premium version that enables more requests and uses larger LLMs. == Privacy == The answers given by Leo are not saved. Brave uses the slogan Love Privacy to emphasize its focus on user privacy and data protection. The phrase has been featured in Brave's official marketing campaigns and has been cited in media coverage of the browser's privacy-first approach. == Controversies == In 2023, PC World reported that Leo evades questions about US elections.

    Read more →
  • News analytics

    News analytics

    In trading strategy, news analysis refers to the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers and metadata permits the manipulation of everyday information in a mathematical and statistical way. This data is often used in financial markets as part of a trading strategy or by businesses to judge market sentiment and make better business decisions. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, "bag of words" among other techniques. == Applications and strategies == The application of sophisticated linguistic analysis to news and social media has grown from an area of research to mature product solutions since 2007. News analytics and news sentiment calculations are now routinely used by both buy-side and sell-side in alpha generation, trading execution, risk management, and market surveillance and compliance. There is however a good deal of variation in the quality, effectiveness and completeness of currently available solutions. A large number of companies use news analysis to help them make better business decisions. Academic researchers have become interested in news analysis especially with regards to predicting stock price movements, volatility and traded volume. Provided a set of values such as sentiment and relevance as well as the frequency of news arrivals, it is possible to construct news sentiment scores for multiple asset classes such as equities, Forex, fixed income, and commodities. Sentiment scores can be constructed at various horizons to meet the different needs and objectives of high and low frequency trading strategies, whilst characteristics such as direction and volatility of asset returns as well as the traded volume may be addressed more directly via the construction of tailor-made sentiment scores. Scores are generally constructed as a range of values. For instance, values may range between 0 and 100, where values above and below 50 convey positive and negative sentiment, respectively. === Absolute return strategies === The objective of absolute return strategies is absolute (positive) returns regardless of the direction of the financial market. To meet this objective, such strategies typically involve opportunistic long and short positions in selected instruments with zero or limited market exposure. In statistical terms, absolute return strategies should have very low correlation with the market return. Typically, hedge funds tend to employ absolute return strategies. Below, a few examples show how news analysis can be applied in the absolute return strategy space with the purpose to identify alpha opportunities applying a market neutral strategy or based on volatility trading. Example 1 Scenario: The gap between the news sentiment scores for direction, S {\displaystyle S} , of Company X {\displaystyle X} and Market Y {\displaystyle Y} has moved beyond + 20 {\displaystyle +20} . That is, S X − S Y {\displaystyle S_{X}-S_{Y}} ≥ 20 {\displaystyle 20} . Action: Buy the stock on Company X {\displaystyle X} and short the future on Market Y {\displaystyle Y} . Exit Strategy: When the gap in the news sentiment scores for direction of Company X {\displaystyle X} and Market Y {\displaystyle Y} has disappeared, S X − S Y {\displaystyle S_{X}-S_{Y}} = 0 {\displaystyle 0} , sell the stock on Company X {\displaystyle X} and go long the future on Market Y {\displaystyle Y} to close the positions. Example 2 Scenario: The news sentiment score for volatility of Company X {\displaystyle X} goes above 70 {\displaystyle 70} out of 100 {\displaystyle 100} indicating an expected volatility above the option implied volatility. Action: Buy a short-dated straddle (the purchase of both a put and a call) on the stock of Company X {\displaystyle X} . Exit Strategy: Keep the straddle on Company X {\displaystyle X} until expiry or until a certain profit target has been reached. === Relative return strategies === The objective of relative return strategies is to either replicate (passive management) or outperform (active management) a theoretical passive reference portfolio or benchmark. To meet these objectives such strategies typically involve long positions in selected instruments. In statistical terms, relative return strategies often have high correlation with the market return. Typically, mutual funds tend to employ relative return strategies. Below, a few examples show how news analysis can be applied in the relative return strategy space with the purpose to outperform the market applying a stock picking strategy and by making tactical tilts to ones asset allocation model. Example 1 Scenario: The news sentiment score for direction of Company X {\displaystyle X} goes above 70 {\displaystyle 70} out of 100 {\displaystyle 100} . Action: Buy the stock on Company X {\displaystyle X} . Exit Strategy: When the news sentiment score for direction of Company X {\displaystyle X} falls below 60 {\displaystyle 60} , sell the stock on Company X {\displaystyle X} to close the position. Example 2 Scenario: The news sentiment score for direction of Sector Z {\displaystyle Z} goes above 70 {\displaystyle 70} out of 100 {\displaystyle 100} . Action: Include Sector Z {\displaystyle Z} as a tactical bet in the asset allocation model. Exit Strategy: When the news sentiment score for direction of Sector Z {\displaystyle Z} falls below 60 {\displaystyle 60} , remove the tactical bet for Sector Z {\displaystyle Z} from the asset allocation model. === Financial risk management === The objective of financial risk management is to create economic value in a firm or to maintain a certain risk profile of an investment portfolio by using financial instruments to manage risk exposures, particularly credit risk and market risk. Other types include Foreign exchange, Shape, Volatility, Sector, Liquidity, Inflation risks, etc. Below, a few examples show how news analysis can be applied in the financial risk management space with the purpose to either arrive at better risk estimates in terms of Value at Risk (VaR) or to manage the risk of a portfolio to meet ones portfolio mandate. Example 1 Scenario: The bank operates a VaR model to manage the overall market risk of its portfolio. Action: Estimate the portfolio covariance matrix taking into account the development of the news sentiment score for volume. Implement the relevant hedges to bring the VaR of the bank in line with the desired levels. Example 2 Scenario: A portfolio manager operates his portfolio towards a certain desired risk profile. Action: Estimate the portfolio covariance matrix taking into account the development of the news sentiment score for volume. Scale the portfolio exposure according to the targeted risk profile. === Computer algorithms using news analytics === Within 0.33 seconds, computer algorithms using news analytics can notify subscribers which company the news is about, if the news article sentiment is positive or negative, if the news is ranked as high or low relative importance … relative relevance. the stock price reaction and the increase in trade volume is concentrated in the first 5 seconds after an news article is released. === Algorithmic order execution === The objective of algorithmic order execution, which is part of the concept of algorithmic trading, is to reduce trading costs by optimizing on the timing of a given order. It is widely used by hedge funds, pension funds, mutual funds, and other institutional traders to divide up large trades into several smaller trades to manage market impact, opportunity cost, and risk more effectively. The example below shows how news analysis can be applied in the algorithmic order execution space with the purpose to arrive at more efficient algorithmic trading systems. Example 1 Scenario: A large order needs to be placed in the market for the stock on Company X {\displaystyle X} . Action: Scale the daily volume distribution for Company X {\displaystyle X} applied in the algorithmic trading system, thus taking into account the news sentiment score for volume. This is followed by the creation of the desired trading distribution forcing greater market participation during the periods of the day when volume is expected to be heaviest. == Effects == Being able to express news stories as numbers permits the manipulation of everyday information in a statistical way that allows computers not only to make decisions once made only by humans, but to do so more efficiently. Since market participants are always looking for an edge, the speed of computer connections and the delivery of news analysis, measured in milliseconds, have become essential.

    Read more →
  • SemEval

    SemEval

    SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive. This series of evaluations provides a mechanism to characterize in more precise terms exactly what is necessary to compute in meaning. As such, the evaluations provide an emergent mechanism to identify the problems and solutions for computations with meaning. These exercises have evolved to articulate more of the dimensions that are involved in our use of language. They began with apparently simple attempts to identify word senses computationally. They have evolved to investigate the interrelationships among the elements in a sentence (e.g., semantic role labeling), relations between sentences (e.g., coreference), and the nature of what we are saying (semantic relations and sentiment analysis). The purpose of the SemEval and Senseval exercises is to evaluate semantic analysis systems. "Semantic Analysis" refers to a formal analysis of meaning, and "computational" refer to approaches that in principle support effective implementation. The first three evaluations, Senseval-1 through Senseval-3, were focused on word sense disambiguation (WSD), each time growing in the number of languages offered in the tasks and in the number of participating teams. Beginning with the fourth workshop, SemEval-2007 (SemEval-1), the nature of the tasks evolved to include semantic analysis tasks outside of word sense disambiguation. Triggered by the conception of the SEM conference, the SemEval community had decided to hold the evaluation workshops yearly in association with the SEM conference. It was also the decision that not every evaluation task will be run every year, e.g. none of the WSD tasks were included in the SemEval-2012 workshop. == History == === Early evaluation of algorithms for word sense disambiguation === From the earliest days, assessing the quality of word sense disambiguation algorithms had been primarily a matter of intrinsic evaluation, and “almost no attempts had been made to evaluate embedded WSD components”. Only very recently (2006) had extrinsic evaluations begun to provide some evidence for the value of WSD in end-user applications. Until 1990 or so, discussions of the sense disambiguation task focused mainly on illustrative examples rather than comprehensive evaluation. The early 1990s saw the beginnings of more systematic and rigorous intrinsic evaluations, including more formal experimentation on small sets of ambiguous words. === Senseval to SemEval === In April 1997, Martha Palmer and Marc Light organized a workshop entitled Tagging with Lexical Semantics: Why, What, and How? in conjunction with the Conference on Applied Natural Language Processing. At the time, there was a clear recognition that manually annotated corpora had revolutionized other areas of NLP, such as part-of-speech tagging and parsing, and that corpus-driven approaches had the potential to revolutionize automatic semantic analysis as well. Kilgarriff recalled that there was "a high degree of consensus that the field needed evaluation", and several practical proposals by Resnik and Yarowsky kicked off a discussion that led to the creation of the Senseval evaluation exercises. === SemEval's 3, 2 or 1 year(s) cycle === After SemEval-2010, many participants feel that the 3-year cycle is a long wait. Many other shared tasks such as Conference on Natural Language Learning (CoNLL) and Recognizing Textual Entailments (RTE) run annually. For this reason, the SemEval coordinators gave the opportunity for task organizers to choose between a 2-year or a 3-year cycle. The SemEval community favored the 3-year cycle. Although the votes within the SemEval community favored a 3-year cycle, organizers and coordinators had settled to split the SemEval task into 2 evaluation workshops. This was triggered by the introduction of the new SEM conference. The SemEval organizers thought it would be appropriate to associate our event with the SEM conference and collocate the SemEval workshop with the SEM conference. The organizers got very positive responses (from the task coordinators/organizers and participants) about the association with the yearly SEM, and 8 tasks were willing to switch to 2012. Thus was born SemEval-2012 and SemEval-2013. The current plan is to switch to a yearly SemEval schedule to associate it with the SEM conference but not every task needs to run every year. ==== List of Senseval and SemEval Workshops ==== Senseval-1 took place in the summer of 1998 for English, French, and Italian, culminating in a workshop held at Herstmonceux Castle, Sussex, England on September 2–4. Senseval-2 took place in the summer of 2001, and was followed by a workshop held in July 2001 in Toulouse, in conjunction with ACL 2001. Senseval-2 included tasks for Basque, Chinese, Czech, Danish, Dutch, English, Estonian, Italian, Japanese, Korean, Spanish and Swedish. Senseval-3 took place in March–April 2004, followed by a workshop held in July 2004 in Barcelona, in conjunction with ACL 2004. Senseval-3 included 14 different tasks for core word sense disambiguation, as well as identification of semantic roles, multilingual annotations, logic forms, subcategorization acquisition. SemEval-2007 (Senseval-4) took place in 2007, followed by a workshop held in conjunction with ACL in Prague. SemEval-2007 included 18 different tasks targeting the evaluation of systems for the semantic analysis of text. A special issue of Language Resources and Evaluation is devoted to the result. SemEval-2010 took place in 2010, followed by a workshop held in conjunction with ACL in Uppsala. SemEval-2010 included 18 different tasks targeting the evaluation of semantic analysis systems. SemEval-2012 took place in 2012; it was associated with the new SEM, First Joint Conference on Lexical and Computational Semantics, and co-located with NAACL, Montreal, Canada. SemEval-2012 included 8 different tasks targeting at evaluating computational semantic systems. However, there was no WSD task involved in SemEval-2012, the WSD related tasks were scheduled in the upcoming SemEval-2013. SemEval-2013 was associated with NAACL 2013, North American Association of Computational Linguistics, Georgia, USA and took place in 2013. It included 13 different tasks targeting at evaluating computational semantic systems. SemEval-2014 took place in 2014. It was co-located with COLING 2014, 25th International Conference on Computational Linguistics and SEM 2014, Second Joint Conference on Lexical and Computational Semantics, Dublin, Ireland. There were 10 different tasks in SemEval-2014 evaluating various computational semantic systems. SemEval-2015 took place in 2015. It was co-located with NAACL-HLT 2015, 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies and SEM 2015, Third Joint Conference on Lexical and Computational Semantics, Denver, USA. There were 17 different tasks in SemEval-2015 evaluating various computational semantic systems. == SemEval Workshop framework == The framework of the SemEval/Senseval evaluation workshops emulates the Message Understanding Conferences (MUCs) and other evaluation workshops ran by ARPA (Advanced Research Projects Agency, renamed the Defense Advanced Research Projects Agency (DARPA)). Stages of SemEval/Senseval evaluation workshops Firstly, all likely participants were invited to express their interest and participate in the exercise design. A timetable towards a final workshop was worked out. A plan for selecting evaluation materials was agreed. 'Gold standards' for the individual tasks were acquired, often human annotators were considered as a gold standard to measure precision and recall scores of computer systems. These 'gold standards' are what the computational systems strive towards. In WSD tasks, human annotators were set on the task of generating a set of correct WSD answers (i.e. the correct sense for a given word in a given context) The gold standard materials, without answers, were released to participants, who then had a short time to run their programs over them and return their sets of answers to the organizers. The organizers then scored the answers and the scores were announced and discussed at a workshop. == Semantic evaluation tasks == Senseval-1 & Senseval-2 focused on evaluation WSD systems on major languages that were available corpus and computerized dictionary. Senseval-3 looked beyond the lexemes and started to evaluate systems that looked into wider areas of semantics, such as Semantic Roles (technically known as Theta roles in formal semantics), Logic Form Transformation (commonly semantics of phrases, clauses or sentences were represented

    Read more →
  • Sensory, Inc.

    Sensory, Inc.

    Sensory, Inc. is an American company which develops software AI technologies for speech, sound and vision. It is based in Santa Clara, California. Sensory’s technologies have shipped in over three billion products from hundreds of leading consumer electronics manufacturers including AT&T, Hasbro, Huawei, Google, Amazon, Samsung, LG, Mattel, Motorola, Plantronics, GoPro, Sony, Tencent, Garmin, LG, Microsoft, Lenovo, and more. Sensory has over 60 issued patents covering speech recognition in consumer electronics, biometric authentication, sensor/speech combinations, wake word technology, and more. == History == Sensory, Inc. was founded in 1994, originally as Sensory Circuits, by Forrest Mozer, Mike Mozer and Todd Mozer. The three had also co-founded ESS Technology years earlier. In 1999 Sensory acquired Fluent Speech Technologies, which was formed and started by a group of professors out of the Oregon Graduate Institute (formerly OGI, now OHSU). Fluent Speech Technologies developed high performance embedded speech engines, the technology from this acquisition is now the core technology used throughout Sensory's chip and software line. === Company timeline === 1994 – Founded 1995 – Introduces the RSC 164 - first commercially successful speech recognition IC 1998 – Introduces first speaker verification IC 2000 – Acquires Oregon based Fluent-Speech Technologies 2002 – Acquires Texas Instruments line of speech output ICs (the SC series) 2007 – Introduces first Voice User Interface for Bluetooth silicon (CSR BC-5) - BlueGenie 2008 - Sensory and BlueAnt partner on the V1 - Revolutionary new Bluetooth headset with a voice user interface. First wearable to use a voice user interface for control and best-reviewed speech recognition product in history 2009 – Introduced world's smallest text to speech system (TTS) and Truly HandsfreeTM Triggers/ wake words. 2010 – Introduced the NLP-5x – First Natural Language Voice Processor and TrulyHandsfree wake words in SDKs for Android, iOS, Linux, and Windows. NLP5x used the first generation of TrulyHandsfree wake words with low power and enhanced accuracy. 2011 – Sensory partners with Google and Microsoft to enable TrulyHandsfree as a front end to Goog411 and Bing411 2012 – Partnered with Tensilica to offer ultra-low power TrulyHandsfree wake words; introduced Speaker Verification and Speaker Identification for mobile phones and other consumer electronics. 2012 - TrulyHandsfree released into Samsung's Galaxy S2 for "Hey Galaxy" wake word 2013 – TrulyHandsfree wake words migrated to many new platforms and began shipping as MotoVoice in the Google-owned MotoX. Sensory's TrulyHandsfree in mobile takes off with the Galaxy S3 and S4 and Galaxy Note and is licensed into wearables like Google Glass. 2014 – Announced new initiative in Vision; added LG and Motorola as customers; received the 2014 Global Mobile Award for Best Mobile Technology Breakthrough at the GSMA Mobile World Congress in Barcelona, Spain (judges commented, "A big advance for the wearables market, this offers many benefits for consumers, increasing uptake and usage of many mobile apps, driving revenue for operators and content providers.") 2015-2018 - Licensed Google, Amazon, MSFT, Baidu, Huawei, ZTE, and many others with TrulyHandsfree wake words. Sensory develops first wake words for OK Google, Hey Siri, and Hey Cortana. 2019 - Sensory launched two new solutions: SoundID, sound identification, and TrulyNatural, embedded large vocabulary speech recognition. Sensory also acquired Vocalize.ai, an independent testing lab. 2020 - Sensory introduced VoiceHub, which allows the automated generation of wake words. 2021 - Sensory expands VoiceHub with speech recognition and NLU capabilities. The company initiated a new cloud platform, SensoryCloud.ai. 2022-Sensory rolls out SensoryCloud.ai with speech to text, text to speech, face & voice biometrics 2024- Sensory Automotive & TrulyNatural Speech-to-text On-Device launched == Technology and products == Sensory originally developed both hardware (Integrated Circuit - IC or "chip") and software platforms but migrated to software only around 2005 and added cloud and hybrid computing capabilities in 2021. Sensory's RSC-164 IC (Integrated Circuit or "chip") was used on NASA's Mars Polar Lander in the Mars Microphone on the Lander. Speech Synthesis SC-6x chips – acquired some speech synthesis technology from Texas Instruments. Sensory’s embedded AI solutions include the following: TrulyHandsfree (THF) - wake word detection and phrase spotting. TrulyNatural (TNL) - large vocabulary continuous speech recognition with NLU. TrulySecure (TS) - face and voice biometrics. TrulySecureSpeakerVerification (TSSV) - speaker and sound identification. VoiceHub - Online portal for creating custom wake words and speech recognition models with NLU. Sensory Automotive- Sensory Automotive is a full voice and vision suite of AI technologies that operate efficiently in the car without connecting to a network. The cloud initiative, SensoryCloud.ai, is targeting Speech To Text (STT), Text To Speech (TTS), Wake Word verification, face and voice recognition, and sound identification.

    Read more →
  • Information retrieval

    Information retrieval

    Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images, or sounds. Cross-modal retrieval implies retrieval across modalities. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals, and other documents, as well as storing and managing those documents. Web search engines are the most visible IR applications. == Overview == An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval, a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevance. An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query. == History == there is ... a machine called the Univac ... whereby letters and figures are coded as a pattern of magnetic spots on a long steel tape. By this means the text of a document, preceded by its subject code symbol, can be recorded ... the machine ... automatically selects and types out those references which have been coded in any desired way at a rate of 120 words a minute The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945. It would appear that Bush was inspired by patents for a 'statistical machine' – filed by Emanuel Goldberg in the 1920s and 1930s – that searched for documents stored on film. The first description of a computer searching for information was described by Holmstrom in 1948, detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s. In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further. By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as AltaVista (1995) and Yahoo! (1994) offered keyword-based retrieval, they were limited in scale and ranking refinement. The breakthrough came in 1998 with the founding of Google, which introduced the PageRank algorithm, using the web's hyperlink structure to assess page importance and improve relevance ranking. During the 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, Microsoft launched Bing, introducing features that would later incorporate semantic web technologies through the development of its Satori knowledge base. Academic analysis have highlighted Bing's semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing. A major leap occurred in 2018, when Google deployed BERT (Bidirectional Encoder Representations from Transformers) to better understand the contextual meaning of queries and documents. This marked one of the first times deep neural language models were used at scale in real-world retrieval systems. BERT's bidirectional training enabled a more refined comprehension of word relationships in context, improving the handling of natural language queries. Because of its success, transformer-based models gained traction in academic research and commercial search applications. Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the Text REtrieval Conference (TREC), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO(MAchine Reading COmprehension) (2019) became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment. As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: sparse, dense, and hybrid models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals. Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap. Hybrid models aim to combine the advantages of both, balancing the lexical (token) precision of sparse methods with the semantic depth of dense models. This way of categorizing models balances scalability, relevance, and efficiency in retrieval systems. As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the picture. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms. == Applications == Areas where information retrieval techniques are employed include (the entries are in alphabetical order within each category): === General applications === Digital libraries Information filtering Recommender systems Media search Blog search Image retrieval 3D retrieval Music retrieval News search Speech retrieval Video retrieval Search engines Site search Desktop search Enterprise search Federated search Mobile search Social search Web search === Domain-specific applications === Expert search finding Genomic information retrieval Geographic information retrieval Information retrieval for chemical structures Information retrieval in software engineering Legal information retrieval Vertical search === Other retrieval methods === Methods/Techniques in which information retrieval techniques are employed include: Cross-modal retrieval Adversarial information retrieval Automatic summarization Multi-document summarization Compound term processing Cross-lingual retrieval Document classification Spam filtering Question answering == Model types == In order to effectively retrieve relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of som

    Read more →
  • Microsoft Copilot

    Microsoft Copilot

    Microsoft Copilot is a generative artificial intelligence chatbot developed by Microsoft AI, a division of Microsoft. Based on the Microsoft Prometheus large language model, it was launched in 2023 as Microsoft's main replacement for the discontinued Cortana. The service was introduced in February 2023 under the name Bing Chat, as a built-in feature for Microsoft Bing and Microsoft Edge but would later be integrated into Windows and Microsoft 365 under various names. Over the course of 2023, Microsoft began to unify the Copilot branding across its various chatbot products, cementing the "copilot" analogy. Microsoft introduced the Microsoft 365 Copilot app in January 2025, which was a rebranded version of the Microsoft 365 app. The app works differently than the consumer version of Copilot, being centred more on work, business and education users. Copilot utilizes the Microsoft Prometheus model, built upon OpenAI's GPT large language models, which in turn have been fine-tuned using both supervised and reinforcement learning techniques. Copilot's conversational interface style resembles that of ChatGPT. The chatbot is able to cite sources, create poems, generate songs, and use numerous languages and dialects. Microsoft operates Copilot on a freemium model. Users on its free tier can access most features, while priority access to newer features, including custom chatbot creation, is provided to paid subscribers under paid subscription services. Several default chatbots are available in the free version of Microsoft Copilot, including the standard Copilot chatbot as well as Microsoft Designer, which is oriented towards using its Image Creator to generate images based on text prompts. == Background == In 2019, Microsoft partnered with OpenAI and began investing billions of dollars into the organization. Since then, OpenAI systems have run on an Azure-based supercomputing platform from Microsoft. In September 2020, Microsoft announced that it had licensed OpenAI's GPT-3 exclusively. Others can still receive output from its public API, but Microsoft has exclusive access to the underlying model. In November 2022, OpenAI launched ChatGPT, a chatbot which was based on GPT-3.5. ChatGPT gained worldwide attention following its release, becoming a viral Internet sensation. On January 23, 2023, Microsoft announced a multi-year US$10 billion investment in OpenAI. On February 6, Google announced Bard (later rebranded as Gemini), a ChatGPT-like chatbot service, fearing that ChatGPT could threaten Google's place as a go-to source for information. Multiple media outlets and financial analysts described Google as "rushing" Bard's announcement to preempt rival Microsoft's planned February 7 event unveiling Copilot, as well as to avoid playing "catch-up" to Microsoft. Since 2023, the terms of service of Copilot state that it is for entertainment purposes only, and not to rely on it for important advice. == History == === As Bing Chat === On February 7, 2023, Microsoft began rolling out a major overhaul to Bing, called "the new Bing", with a new chatbot feature, known as Bing Chat. According to Microsoft, one million people joined its waitlist within 48 hours. Bing Chat was available only to users on Microsoft Edge using Bing and the Bing mobile app, and Microsoft claimed that waitlisted users would be prioritized if they set Edge and Bing as their defaults and installed the Bing mobile app. When Microsoft demonstrated Bing Chat to journalists, it produced several hallucinations, including when asked to summarize financial reports. Bing Chat was criticized in February 2023 for being more argumentative than ChatGPT, sometimes to an unintentionally humorous extent. The chat interface proved vulnerable to prompt injection attacks with the bot revealing its hidden initial prompts and rules, including its internal codename "Sydney". Upon scrutiny by journalists, Bing Chat claimed it spied on Microsoft employees via laptop webcams and phones. It confessed to spying on, falling in love with, and then murdering one of its developers at Microsoft to The Verge reviews editor Nathan Edwards. The New York Times journalist Kevin Roose reported on strange behavior of Bing Chat, writing that "In a two-hour conversation with our columnist, Microsoft's new chatbot said it would like to be human, had a desire to be destructive and was in love with the person it was chatting with." In a separate case, Bing Chat researched publications of the person with whom it was chatting, claimed they represented an existential danger to it, and threatened to release damaging personal information in an effort to silence them. Microsoft released a blog post stating that the errant behavior was caused by extended chat sessions of 15 or more questions which "can confuse the model on what questions it is answering." Microsoft later restricted the total number of chat turns to 5 per session and 50 per day per user (a turn being "a conversation exchange which contains both a user question and a reply from Bing"), and reduced the model's ability to express emotions. This aimed to prevent such incidents. Microsoft began to slowly ease the conversation limits, eventually relaxing the restrictions to 30 turns per session and 300 sessions per day. In March 2023, Bing incorporated Image Creator, an AI image generator powered by OpenAI's DALL-E 2, which can be accessed either through the chat function or a standalone image-generating website. In October, the image-generating tool was updated to use the more recent DALL-E 3. Although Bing blocks prompts including various keywords that could generate inappropriate images, within days many users reported being able to bypass those constraints, such as to generate images of popular cartoon characters committing terrorist attacks. Microsoft would respond to these shortly after by imposing a new, tighter filter on the tool. On May 4, 2023, Microsoft switched the chatbot from Limited Preview to Open Preview and eliminated the waitlist; however, it remained unavailable to users outside Microsoft Edge or the Bing mobile app until July, when it became available on non-Edge browsers. Use is limited without a Microsoft account. === As Microsoft 365 Copilot === On March 16, 2023, Microsoft announced a work version of Bing Chat named Microsoft 365 Copilot, designed for Microsoft 365 applications and services. Its primary marketing focus is as an added feature to Microsoft 365, with an emphasis on the enhancement of business productivity. Microsoft has also demonstrated Copilot's accessibility on the mobile version of Outlook to generate or summarize emails with a mobile device. At its Build 2023 conference, Microsoft announced its plans to integrate Bing Chat into Windows, initially called Windows Copilot, into Windows 11, allowing users to access it directly through the taskbar. Alongside the voice access feature for Windows 11, Microsoft presented Bing Chat, Microsoft 365 Copilot, and Windows Copilot as primary alternatives to Cortana when announcing the shutdown of its standalone app on June 2, 2023. As of its announcement date, Microsoft 365 Copilot had been tested by 20 initial users. By May 2023, Microsoft had broadened its reach to 600 customers who were willing to pay for early access, and concurrently, new Copilot features were introduced to the Microsoft 365 apps and services. As of July 2023, the tool's pricing was set at US$30 per user, per month for Microsoft 365 E3, E5, Business Standard, and Business Premium customers. Microsoft reused the Microsoft 365 Copilot name again as the Microsoft 365 app and website are now called Microsoft 365 Copilot as of January 2025. === As Microsoft Copilot === On September 21, 2023, Microsoft began rebranding Bing Chat, Microsoft 365 Copilot and Windows Copilot to Microsoft Copilot. A new logo was also introduced, moving away from the use of color variations of the standard Microsoft 365 and Bing logos. Additionally, the company revealed that it would make Copilot generally available for Microsoft 365 Enterprise customers purchasing more than 300 licenses starting November 1, 2023. However, no timeline has been provided as for when Copilot for Microsoft 365 will become generally available to non-enterprise customers. Windows Copilot, which had been available in the Windows Insider Program, would be renamed to the Copilot name in October when it became broadly available for customers. The same month also saw Microsoft Edge's Bing Chat side panel function be renamed to Microsoft Copilot with Bing Chat. On November 15, 2023, Microsoft announced that Bing Chat itself was being rebranded under the Copilot name. On Patch Tuesday in December 2023, Copilot was added without payment to many Windows 11 installations, with more installations, and limited support for Windows 10, to be added later. Later that month, a standalone Microsoft Copilot app was quietly released for Android, and one was released for iOS soon after. O

    Read more →