AI News and Guides

Explore the best AI News and Guides — independent reviews, comparisons, pricing and step-by-step how-to guides, curated by Aizhi.

  • List of speech recognition software

    List of speech recognition software

    Speech recognition software is available for many computing platforms, operating systems, use models, and software licenses. Here is a listing of such, grouped in various useful ways. == Acoustic models and speech corpus (compilation) == The following list presents notable speech recognition software engines with a brief synopsis of characteristics. == Macintosh == == Cross-platform web apps based on Chrome == The following list presents notable speech recognition software that operate in a Chrome browser as web apps. They make use of HTML5 Web-Speech-API. == Mobile devices and smartphones == Many mobile phone handsets, including feature phones and smartphones such as iPhones and BlackBerrys, have basic dial-by-voice features built in. Many third-party apps have implemented natural-language speech recognition support, including: == Windows == === Windows built-in speech recognition === The Windows Speech Recognition version 8.0 by Microsoft comes built into Windows Vista, Windows 7, Windows 8 and Windows 10. Speech Recognition is available only in English, French, Spanish, German, Japanese, Simplified Chinese, and Traditional Chinese and only in the corresponding version of Windows; meaning you cannot use the speech recognition engine in one language if you use a version of Windows in another language. Windows 7 Ultimate and Windows 8 Pro allow you to change the system language, and therefore change which speech engine is available. Windows Speech Recognition evolved into Cortana (software), a personal assistant included in Windows 10. === Windows 7, 8, 10, 11 third-party speech recognition === Braina – Dictate into third party software and websites, fill web forms and execute vocal commands. Dragon NaturallySpeaking from Nuance Communications – Successor to the older DragonDictate product. Focus on dictation. 64-bit Windows support since version 10.1. Tazti – Create speech command profiles to play PC games and control applications – programs. Create speech commands to open files, folders, webpages, applications. Windows 7, Windows 8 and Windows 8.1 versions. Voice Finger – software that improves the Windows speech recognition system by adding several extensions to it. The software enables controlling the mouse and the keyboard by only using the voice. It is especially useful for aiding users to overcome disabilities or to heal from computer injuries. === Microsoft Speech API === The first version of the Microsoft Speech API was released for Windows NT 3.51 and Windows 95 in 1995, it was then part of Windows up to Windows Vista. This initial version already contained Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs. Speech recognition functionality included as part of Microsoft Office and on Tablet PCs running Microsoft Windows XP Tablet PC Edition. It can also be downloaded as part of the Speech SDK 5.1 for Windows applications, but since that is aimed at developers building speech applications, the pure SDK form lacks any user interface (numerous applications were available), and thus is unsuitable for end users. == Built-in software == Microsoft Kinect includes built-in software which allows speech recognition of commands. Older generations of Nokia phones like Nokia N Series (before using Windows 7 mobile technology) used speech-recognition with family names from contact list and a few commands. Siri, originally implemented in the iPhone 4S, Apple's personal assistant for iOS, which uses technology from Nuance Communications. Cortana (software), Microsoft's personal assistant built into Windows Phone and Windows 10. == Interactive voice response == The following are interactive voice response (IVR) systems: CSLU Toolkit Genesys HTK – copyrighted by Microsoft, but allows altering software for licensee's internal use LumenVox ASR Tellme Networks; acquired by Microsoft == Unix-like x86 and x86-64 speech transcription software == Janus Recognition Toolkit (JRTk) Mozilla DeepSpeech was developing an open-source Speech-To-Text engine based on Baidu's deep speech research paper. Weesper Neon Flow – professional voice-dictation software that provides offline speech-to-text processing on macOS and Windows using local AI models. It is not open source and offers a paid subscription after a 15‑day free trial. Vocalinux – open-source speech transcription software for Linux. == Discontinued software == IBM VoiceType (formerly IBM Personal Dictation System) IBM ViaVoice – Embedded version still maintained by IBM. No longer supported for versions above Windows Vista. Untested above macOS 10.4 or on Macintoshes with an Intel chipset. Quack.com; acquired by AOL; the name has now been reused for an iPad search app. SpeechWorks from Nuance Communications. Yap Speech Cloud – Speech-to-text platform acquired by Amazon.com.

    Read more →
  • Point-set registration

    Point-set registration

    In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation (e.g., scaling, rotation and translation) that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model (or coordinate frame), and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging. As a special case, registration of two point sets that only differ by a 3D rotation (i.e., there is no scaling and translation), is called the Wahba Problem and also related to the orthogonal procrustes problem. == Formulation == The problem may be summarized as follows: Let { M , S } {\displaystyle \lbrace {\mathcal {M}},{\mathcal {S}}\rbrace } be two finite size point sets in a finite-dimensional real vector space R d {\displaystyle \mathbb {R} ^{d}} , which contain M {\displaystyle M} and N {\displaystyle N} points respectively (e.g., d = 3 {\displaystyle d=3} recovers the typical case of when M {\displaystyle {\mathcal {M}}} and S {\displaystyle {\mathcal {S}}} are 3D point sets). The problem is to find a transformation to be applied to the moving "model" point set M {\displaystyle {\mathcal {M}}} such that the difference (typically defined in the sense of point-wise Euclidean distance) between M {\displaystyle {\mathcal {M}}} and the static "scene" set S {\displaystyle {\mathcal {S}}} is minimized. In other words, a mapping from R d {\displaystyle \mathbb {R} ^{d}} to R d {\displaystyle \mathbb {R} ^{d}} is desired which yields the best alignment between the transformed "model" set and the "scene" set. The mapping may consist of a rigid or non-rigid transformation. The transformation model may be written as T {\displaystyle T} , using which the transformed, registered model point set is: The output of a point set registration algorithm is therefore the optimal transformation T ⋆ {\displaystyle T^{\star }} such that M {\displaystyle {\mathcal {M}}} is best aligned to S {\displaystyle {\mathcal {S}}} , according to some defined notion of distance function dist ⁡ ( ⋅ , ⋅ ) {\displaystyle \operatorname {dist} (\cdot ,\cdot )} : where T {\displaystyle {\mathcal {T}}} is used to denote the set of all possible transformations that the optimization tries to search for. The most popular choice of the distance function is to take the square of the Euclidean distance for every pair of points: where ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} denotes the vector 2-norm, s m {\displaystyle s_{m}} is the corresponding point in set S {\displaystyle {\mathcal {S}}} that attains the shortest distance to a given point m {\displaystyle m} in set M {\displaystyle {\mathcal {M}}} after transformation. Minimizing such a function in rigid registration is equivalent to solving a least squares problem. == Types of algorithms == When the correspondences (i.e., s m ↔ m {\displaystyle s_{m}\leftrightarrow m} ) are given before the optimization, for example, using feature matching techniques, then the optimization only needs to estimate the transformation. This type of registration is called correspondence-based registration. On the other hand, if the correspondences are unknown, then the optimization is required to jointly find out the correspondences and transformation together. This type of registration is called simultaneous pose and correspondence registration. === Rigid registration === Given two point sets, rigid registration yields a rigid transformation which maps one point set to the other. A rigid transformation is defined as a transformation that does not change the distance between any two points. Typically such a transformation consists of translation and rotation. In rare cases, the point set may also be mirrored. In robotics and computer vision, rigid registration has the most applications. === Non-rigid registration === Given two point sets, non-rigid registration yields a non-rigid transformation which maps one point set to the other. Non-rigid transformations include affine transformations such as scaling and shear mapping. However, in the context of point set registration, non-rigid registration typically involves nonlinear transformation. If the eigenmodes of variation of the point set are known, the nonlinear transformation may be parametrized by the eigenvalues. A nonlinear transformation may also be parametrized as a thin plate spline. === Other types === Some approaches to point set registration use algorithms that solve the more general graph matching problem. However, the computational complexity of such methods tend to be high and they are limited to rigid registrations. In this article, we will only consider algorithms for rigid registration, where the transformation is assumed to contain 3D rotations and translations (possibly also including a uniform scaling). The PCL (Point Cloud Library) is an open-source framework for n-dimensional point cloud and 3D geometry processing. It includes several point registration algorithms. == Correspondence-based registration == Correspondence-based methods assume the putative correspondences m ↔ s m {\displaystyle m\leftrightarrow s_{m}} are given for every point m ∈ M {\displaystyle m\in {\mathcal {M}}} . Therefore, we arrive at a setting where both point sets M {\displaystyle {\mathcal {M}}} and S {\displaystyle {\mathcal {S}}} have N {\displaystyle N} points and the correspondences m i ↔ s i , i = 1 , … , N {\displaystyle m_{i}\leftrightarrow s_{i},i=1,\dots ,N} are given. === Outlier-free registration === In the simplest case, one can assume that all the correspondences are correct, meaning that the points m i , s i ∈ R 3 {\displaystyle m_{i},s_{i}\in \mathbb {R} ^{3}} are generated as follows:where l > 0 {\displaystyle l>0} is a uniform scaling factor (in many cases l = 1 {\displaystyle l=1} is assumed), R ∈ SO ( 3 ) {\displaystyle R\in {\text{SO}}(3)} is a proper 3D rotation matrix ( SO ( d ) {\displaystyle {\text{SO}}(d)} is the special orthogonal group of degree d {\displaystyle d} ), t ∈ R 3 {\displaystyle t\in \mathbb {R} ^{3}} is a 3D translation vector and ϵ i ∈ R 3 {\displaystyle \epsilon _{i}\in \mathbb {R} ^{3}} models the unknown additive noise (e.g., Gaussian noise). Specifically, if the noise ϵ i {\displaystyle \epsilon _{i}} is assumed to follow a zero-mean isotropic Gaussian distribution with standard deviation σ i {\displaystyle \sigma _{i}} , i.e., ϵ i ∼ N ( 0 , σ i 2 I 3 ) {\displaystyle \epsilon _{i}\sim {\mathcal {N}}(0,\sigma _{i}^{2}I_{3})} , then the following optimization can be shown to yield the maximum likelihood estimate for the unknown scale, rotation and translation:Note that when the scaling factor is 1 and the translation vector is zero, then the optimization recovers the formulation of the Wahba problem. Despite the non-convexity of the optimization (cb.2) due to non-convexity of the set SO ( 3 ) {\displaystyle {\text{SO}}(3)} , seminal work by Berthold K.P. Horn showed that (cb.2) actually admits a closed-form solution, by decoupling the estimation of scale, rotation and translation. Similar results were discovered by Arun et al. In addition, in order to find a unique transformation ( l , R , t ) {\displaystyle (l,R,t)} , at least N = 3 {\displaystyle N=3} non-collinear points in each point set are required. More recently, Briales and Gonzalez-Jimenez have developed a semidefinite relaxation using Lagrangian duality, for the case where the model set M {\displaystyle {\mathcal {M}}} contains different 3D primitives such as points, lines and planes (which is the case when the model M {\displaystyle {\mathcal {M}}} is a 3D mesh). Interestingly, the semidefinite relaxation is empirically tight, i.e., a certifiably globally optimal solution can be extracted from the solution of the semidefinite relaxation. === Robust registration === The least squares formulation (cb.2) is known to perform arbitrarily badly in the presence of outliers. An outlier correspondence is a pair of measurements s i ↔ m i {\displaystyle s_{i}\leftrightarrow m_{i}} that departs from the generative model (cb.1). In this case, one can consider a differen

    Read more →
  • Geometric hashing

    Geometric hashing

    In computer science, geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to other object representations and transformations. In an off-line step, the objects are encoded by treating each pair of points as a geometric basis. The remaining points can be represented in an invariant fashion with respect to this basis using two parameters. For each point, its quantized transformed coordinates are stored in the hash table as a key, and indices of the basis points as a value. Then a new pair of basis points is selected, and the process is repeated. In the on-line (recognition) step, randomly selected pairs of data points are considered as candidate bases. For each candidate basis, the remaining data points are encoded according to the basis and possible correspondences from the object are found in the previously constructed table. The candidate basis is accepted if a sufficiently large number of the data points index a consistent object basis. Geometric hashing was originally suggested in computer vision for object recognition in 2D and 3D, but later was applied to different problems such as structural alignment of proteins. == Geometric hashing in computer vision == Geometric hashing is a method used for object recognition. Let’s say that we want to check if a model image can be seen in an input image. This can be accomplished with geometric hashing. The method could be used to recognize one of the multiple objects in a base, in this case the hash table should store not only the pose information but also the index of object model in the base. === Example === For simplicity, this example will not use too many point features and assume that their descriptors are given by their coordinates only (in practice local descriptors such as SIFT could be used for indexing). ==== Training Phase ==== Find the model's feature points. Assume that 5 feature points are found in the model image with the coordinates ( 12 , 17 ) ; {\displaystyle (12,17);} ( 45 , 13 ) ; {\displaystyle (45,13);} ( 40 , 46 ) ; {\displaystyle (40,46);} ( 20 , 35 ) ; {\displaystyle (20,35);} ( 35 , 25 ) {\displaystyle (35,25)} , see the picture. Introduce a basis to describe the locations of the feature points. For 2D space and similarity transformation the basis is defined by a pair of points. The point of origin is placed in the middle of the segment connecting the two points (P2, P4 in our example), the x ′ {\displaystyle x'} axis is directed towards one of them, the y ′ {\displaystyle y'} is orthogonal and goes through the origin. The scale is selected such that absolute value of x ′ {\displaystyle x'} for both basis points is 1. Describe feature locations with respect to that basis, i.e. compute the projections to the new coordinate axes. The coordinates should be discretised to make recognition robust to noise, we take the bin size 0.25. We thus get the coordinates ( − 0.75 , − 1.25 ) ; {\displaystyle (-0.75,-1.25);} ( 1.00 , 0.00 ) ; {\displaystyle (1.00,0.00);} ( − 0.50 , 1.25 ) ; {\displaystyle (-0.50,1.25);} ( − 1.00 , 0.00 ) ; {\displaystyle (-1.00,0.00);} ( 0.00 , 0.25 ) {\displaystyle (0.00,0.25)} Store the basis in a hash table indexed by the features (only transformed coordinates in this case). If there were more objects to match with, we should also store the object number along with the basis pair. Repeat the process for a different basis pair (Step 2). It is needed to handle occlusions. Ideally, all the non-colinear pairs should be enumerated. We provide the hash table after two iterations, the pair (P1, P3) is selected for the second one. Hash Table: Most hash tables cannot have identical keys mapped to different values. So in real life one won’t encode basis keys (1.0, 0.0) and (-1.0, 0.0) in a hash table. ==== Recognition Phase ==== Find interesting feature points in the input image. Choose an arbitrary basis. If there isn't a suitable arbitrary basis, then it is likely that the input image does not contain the target object. Describe coordinates of the feature points in the new basis. Quantize obtained coordinates as it was done before. Compare all the transformed point features in the input image with the hash table. If the point features are identical or similar, then increase the count for the corresponding basis (and the type of object, if any). For each basis such that the count exceeds a certain threshold, verify the hypothesis that it corresponds to an image basis chosen in Step 2. Transfer the image coordinate system to the model one (for the supposed object) and try to match them. If successful, the object is found. Otherwise, go back to Step 2. === Finding mirrored pattern === It seems that this method is only capable of handling scaling, translation, and rotation. However, the input image may contain the object in mirror transform. Therefore, geometric hashing should be able to find the object, too. There are two ways to detect mirrored objects. For the vector graph, make the left side positive, and the right side negative. Multiplying the x position by -1 will give the same result. Use 3 points for the basis. This allows detecting mirror images (or objects). Actually, using 3 points for the basis is another approach for geometric hashing. === Geometric hashing in higher-dimensions === Similar to the example above, hashing applies to higher-dimensional data. For three-dimensional data points, three points are also needed for the basis. The first two points define the x-axis, and the third point defines the y-axis (with the first point). The z-axis is perpendicular to the created axis using the right-hand rule. Notice that the order of the points affects the resulting basis

    Read more →
  • IT operations analytics

    IT operations analytics

    In the fields of information technology (IT) and systems management, IT operations analytics (ITOA) is an approach or method to retrieve, analyze, and report data for IT operations. ITOA may apply big data analytics to large datasets to produce business insights. In 2014, Gartner predicted its use might increase revenue or reduce costs. By 2017, it predicted that 15% of enterprises will use IT operations analytics technologies. == Definition == IT operations analytics (ITOA) (also known as advanced operational analytics, or IT data analytics) technologies are primarily used to discover complex patterns in high volumes of often "noisy" IT system availability and performance data. Forrester Research defined IT analytics as "The use of mathematical algorithms and other innovations to extract meaningful information from the sea of raw data collected by management and monitoring technologies." Note, ITOA is different than AIOps, which focuses on applying artificial intelligence and machine learning to the applications of ITOA. == History == Operations research as a discipline emerged from the Second World War to improve military efficiency and decision-making on the battlefield. However, only with the emergence of machine learning tech in the early 2000s could an artificially intelligent operational analytics platform actually begin to engage in the high-level pattern recognition that could adequately serve business needs. A critical catalyst towards ITOA development was the rise of Google, which pioneered a predictive analytics model that represented the first attempt to read into patterns of human behavior on the Internet. IT specialists then applied predictive analytics to the IT Industry, coming forward with platforms that can sift through data to generate insights without the need for human intervention. Due to the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more big data practices, the ITOA industry has grown significantly since 2010. A 2016 ExtraHop survey of large and mid-size corporations indicates that 65 percent of the businesses surveyed will seek to integrate their data silos either this year or the next. The current goals of ITOA platforms are to improve the accuracy of their APM services, facilitate better integration with the data, and to enhance their predictive analytics capabilities. == Applications == ITOA systems tend to be used by IT operations teams, and Gartner describes seven applications of ITOA systems: Root cause analysis: The models, structures and pattern descriptions of IT infrastructure or application stack being monitored can help users pinpoint fine-grained and previously unknown root causes of overall system behavior pathologies. Proactive control of service performance and availability: Predicts future system states and the impact of those states on performance. Problem assignment: Determines how problems may be resolved or, at least, direct the results of inferences to the most appropriate individuals, or communities in the enterprise for problem resolution. Service impact analysis: When multiple root causes are known, the analytics system's output is used to determine and rank the relative impact, so that resources can be devoted to correcting the fault in the most timely and cost-effective way possible. Complement best-of-breed technology: The models, structures and pattern descriptions of IT infrastructure or application stack being monitored are used to correct or extend the outputs of other discovery-oriented tools to improve the fidelity of information used in operational tasks (e.g., service dependency maps, application runtime architecture topologies, network topologies). Real time application behavior learning: Learns & correlates the behavior of Application based on user pattern and underlying Infrastructure on various application patterns, create metrics of such correlated patterns and store it for further analysis. Dynamically baselines threshold: Learns behavior of Infrastructure on various application user patterns and determines the Optimal behavior of the Infra and technological components, bench marks and baselines the low and high water mark for the specific environments and dynamically changes the bench mark baselines with the changing infra and user patterns without any manual intervention. == Types == In their Data Growth Demands a Single, Architected IT Operations Analytics Platform, Gartner Research describes five types of analytics technologies: Log analysis Unstructured text indexing, search and inference (UTISI) Topological analysis (TA) Multidimensional database search and analysis (MDSA) Complex operations event processing (COEP) Statistical pattern discovery and recognition (SPDR) == Tools and ITOA platforms == A number of vendors operate in the ITOA space:

    Read more →
  • Robinson compass mask

    Robinson compass mask

    In image processing, a Robinson compass mask is a type of compass mask used for edge detection. It has eight major compass orientations, each will extract the edges in respect to its direction. A combined use of compass masks of different directions could detect the edges from different angles. == Technical explanation == The Robinson compass mask is defined by taking a single mask and rotating it to form eight orientations: North: [ − 1 0 1 − 2 0 2 − 1 0 1 ] {\displaystyle {\text{North:}}{\begin{bmatrix}-1&0&1\\-2&0&2\\-1&0&1\end{bmatrix}}} North West: [ 0 1 2 − 1 0 1 − 2 − 1 0 ] {\displaystyle {\text{North West:}}{\begin{bmatrix}0&1&2\\-1&0&1\\-2&-1&0\end{bmatrix}}} West: [ 1 2 1 0 0 0 − 1 − 2 − 1 ] {\displaystyle {\text{West:}}{\begin{bmatrix}1&2&1\\0&0&0\\-1&-2&-1\end{bmatrix}}} South West: [ 2 1 0 1 0 − 1 0 − 1 − 2 ] {\displaystyle {\text{South West:}}{\begin{bmatrix}2&1&0\\1&0&-1\\0&-1&-2\end{bmatrix}}} South: [ 1 0 − 1 2 0 − 2 1 0 − 1 ] {\displaystyle {\text{South:}}{\begin{bmatrix}1&0&-1\\2&0&-2\\1&0&-1\end{bmatrix}}} South East: [ 0 − 1 − 2 1 0 − 1 2 1 0 ] {\displaystyle {\text{South East:}}{\begin{bmatrix}0&-1&-2\\1&0&-1\\2&1&0\end{bmatrix}}} East: [ − 1 − 2 − 1 0 0 0 1 2 1 ] {\displaystyle {\text{East:}}{\begin{bmatrix}-1&-2&-1\\0&0&0\\1&2&1\end{bmatrix}}} North East: [ − 2 − 1 0 − 1 0 1 0 1 2 ] {\displaystyle {\text{North East:}}{\begin{bmatrix}-2&-1&0\\-1&0&1\\0&1&2\end{bmatrix}}} The direction axis is the line of zeros in the matrix. Robinson compass mask is similar to kirsch compass masks, but is simpler to implement. Since the matrix coefficients only contains 0, 1, 2, and are symmetrical, only the results of four masks need to be calculated, the other four results are the negation of the first four results. An edge, or contour is an tiny area with neighboring distinct pixel values. The convolution of each mask with the image would create a high value output where there is a rapid change of pixel value, thus an edge point is found. All the detected edge points would line up as edges. == Example == An example of Robinson compass masks applied to the original image. Obviously, the edges in the direction of the mask is enhanced.

    Read more →
  • PARRY

    PARRY

    PARRY was an early example of a chatbot, implemented in 1972 by psychiatrist Kenneth Colby. == History == PARRY was written in 1972 by psychiatrist Kenneth Colby, then at Stanford University. While ELIZA was a simulation of a Rogerian therapist, PARRY attempted to simulate a person with paranoid schizophrenia. The program implemented a crude model of the behavior of a person with paranoid schizophrenia based on concepts, conceptualizations, and beliefs (judgements about conceptualizations: accept, reject, neutral). It also embodied a conversational strategy, and as such was a much more serious and advanced program than ELIZA. It was described as "ELIZA with attitude". PARRY was tested in the early 1970s using a variation of the Turing Test. A group of experienced psychiatrists analysed a combination of real patients and computers running PARRY through teleprinters. Another group of 33 psychiatrists were shown transcripts of the conversations. The two groups were then asked to identify which of the "patients" were human and which were computer programs. The psychiatrists were able to make the correct identification only 48 percent of the time — a figure consistent with random guessing. PARRY and ELIZA (also known as "the Doctor") interacted several times. The most famous of these exchanges occurred at the ICCC 1972, where PARRY and ELIZA were hooked up over ARPANET and responded to each other.

    Read more →
  • Language model benchmark

    Language model benchmark

    A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

    Read more →
  • XLNet

    XLNet

    The XLNet was an autoregressive Transformer designed as an improvement over BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June 2019, under the Apache 2.0 license. It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference. == Architecture == The main idea of XLNet is to model language autoregressively like the GPT models, but allow for all possible permutations of a sentence. Concretely, consider the following sentence:My dog is cute.In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context: We factorize the joint probability of a sequence of words x 1 , … , x T {\displaystyle x_{1},\ldots ,x_{T}} using the chain rule: Pr ( x 1 , … , x T ) = Pr ( x 1 ) Pr ( x 2 | x 1 ) Pr ( x 3 | x 1 , x 2 ) … Pr ( x T | x 1 , … , x T − 1 ) . {\displaystyle \Pr(x_{1},\ldots ,x_{T})=\Pr(x_{1})\Pr(x_{2}|x_{1})\Pr(x_{3}|x_{1},x_{2})\ldots \Pr(x_{T}|x_{1},\ldots ,x_{T-1}).} For example, the sentence "My dog is cute" is factorized as: Pr ( My , dog , is , cute ) = Pr ( My ) Pr ( dog | My ) Pr ( is | My , dog ) Pr ( cute | My , dog , is ) . {\displaystyle \Pr({\text{My}},{\text{dog}},{\text{is}},{\text{cute}})=\Pr({\text{My}})\Pr({\text{dog}}|{\text{My}})\Pr({\text{is}}|{\text{My}},{\text{dog}})\Pr({\text{cute}}|{\text{My}},{\text{dog}},{\text{is}}).} Schematically, we can write it as → My → My dog → My dog is → My dog is cute . {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My }}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My dog }}{\texttt {}}{\texttt {}}\to {\text{My dog is }}{\texttt {}}\to {\text{My dog is cute}}.} However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task: is dog is dog is cute → My dog is cute {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\texttt {}}{\texttt {}}{\text{is }}{\texttt {}}\to {\texttt {}}{\text{dog is }}{\texttt {}}\to {\texttt {}}{\text{dog is cute}}\to {\text{My dog is cute}}} By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words. === Two-Stream Self-Attention === To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are: Content stream: This stream encodes the content of each word, as in standard causally masked self-attention. Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream. The content stream uses the causal mask M causal = [ 0 − ∞ − ∞ … − ∞ 0 0 − ∞ … − ∞ 0 0 0 … − ∞ ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 … 0 ] {\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}} permuted by a random permutation matrix to P M causal P − 1 {\displaystyle PM_{\text{causal}}P^{-1}} . The query stream uses the cross-attention mask P ( M causal − ∞ I ) P − 1 {\displaystyle P(M_{\text{causal}}-\infty I)P^{-1}} , where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is. Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass. == Training == Two models were released: XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads. It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of BooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, and Common Crawl. It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an Adam optimizer, linear learning rate decay, and a batch size of 8192.

    Read more →
  • Reasoning model

    Reasoning model

    A reasoning model, also known as a reasoning language model (RLM) or large reasoning model (LRM), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute. == Overview == Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams. In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems. == History == The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference. Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks. Input → Step 1 → Step 2 → ⋯ → Step n ⏟ Reasoning chain → Answer {\displaystyle {\text{Input}}\rightarrow \underbrace {{\text{Step}}_{1}\rightarrow {\text{Step}}_{2}\rightarrow \cdots \rightarrow {\text{Step}}_{n}} _{\text{Reasoning chain}}\rightarrow {\text{Answer}}} A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning. Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts. OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s "Let's Verify Step by Step" reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment. OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference. Together, these developments define the core of reasoning models. They use supervision signals that evaluate the quality of intermediate steps, they exploit inference-time exploration such as consensus or tree search, and they expose controls for how much internal thinking compute to allocate. OpenAI's o1 family made this approach available at scale in September 2024 and popularized the label "reasoning model" for LLMs that deliberately think before they answer. The development of reasoning models illustrates Richard S. Sutton's "bitter lesson" that scaling compute typically outperforms methods based on human-designed insights. This principle was demonstrated by researchers at the Generative AI Research Lab (GAIR), who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the "o1 Replication Journey" series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions. === Drawbacks === Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination (AIME) benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts. The extended inference time is attributed to the detailed, step-by-step reasoning outputs that these models generate, which are typically much longer than responses from standard large language models that provide direct answers without showing their reasoning process. One researcher in early 2025 argued that these models may face potential additional denial-of-service concerns with "overthinking attacks." === Releases === ==== 2024 ==== In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities. The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024, with the full o3 model becoming available in 2025. Alibaba released reasoning versions of its Qwen large language models in November 2024. In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model. In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks. On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models. ==== 2025 ==== In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO), a reinforcement learning technique used to train the model. On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks. Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities. On February 2, 2025, OpenAI released Deep Research, a feature powered by their o3 model that enables users to conduct comprehensive research tasks. The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources. OpenAI called GPT-4.5 its "last non-chain-of-thought model", and implemented with GPT-5 a router model that selects a model based on the difficulty of the task. ==== 2026 ==== In January 2026, Moonshot AI released Kimi K2.5, an open-source 1 trillion parameter MoE model with 32 billion active parameters. It uses an “Agent Swarm” system that dynamically decomposes tasks into sub-agents for reasoning and execution, enabling more scalable multi-step problem solving than a single sequential reasoning chain. == Training == Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-

    Read more →
  • Intrinsic dimension

    Intrinsic dimension

    In mathematics, the intrinsic dimension of a subset can be thought of as the minimal number of variables needed to represent the subset. The concept has widespread applications in geometry, dynamical systems, signal processing, statistics, and other fields. Due to its widespread applications and vague conceptualization, there are many different ways to define it rigorously. Consequently, the same set might have different intrinsic dimensions according to different definitions. The intrinsic dimension can be used as a lower bound of what dimension it is possible to compress a data set into through dimension reduction, but it can also be used as a measure of the complexity of the data set or signal. For a data set or signal of N variables, its intrinsic dimension M satisfies 0 ≤ M ≤ N, although estimators may yield higher values. == Exact dimension == === Differential === In differential geometry, given a differentiable manifold N and a submanifold M, the intrinsic dimension of M is its dimension. Suppose N has n dimensions and M has m dimensions, then that means around any point in M, there exists a local coordinate system ( x 1 , … , x m , x m + 1 , … , x n ) {\displaystyle (x_{1},\dots ,x_{m},x_{m+1},\dots ,x_{n})} of N, such that the manifold M is simply the subset of N defined by x m + 1 = 0 , … , x n = 0 {\displaystyle x_{m+1}=0,\dots ,x_{n}=0} . === Metric === Given a mere metric space, we can still define its intrinsic dimension. The most general case is the Hausdorff dimension, though for metric spaces occurring in practice, the box-counting dimension and the packing dimension often are identical to the Hausdorff dimension. Let X , d {\textstyle X,d} be a metric space and A ⊂ X {\textstyle A\subset X} be totally bounded. Define the covering number N ( A , ε ) = min { k : A ⊂ ⋃ i = 1 k B ( x i , ε ) } . {\displaystyle N(A,\varepsilon )=\min \left\{k:A\subset \bigcup _{i=1}^{k}B\left(x_{i},\varepsilon \right)\right\}.} The metric entropy is H ( A , ε ) = log ⁡ N ( A , ε ) {\textstyle H(A,\varepsilon )=\log N(A,\varepsilon )} (any log base). The upper and lower metric entropy dimensions are dim ¯ E A = lim sup ε ↓ 0 H ( A , ε ) log ⁡ ( 1 / ε ) , dim _ E A = lim inf ε ↓ 0 H ( A , ε ) log ⁡ ( 1 / ε ) . {\displaystyle {\overline {\dim }}_{E}A=\limsup _{\varepsilon \downarrow 0}{\frac {H(A,\varepsilon )}{\log(1/\varepsilon )}},\quad {\underline {\dim }}_{E}A=\liminf _{\varepsilon \downarrow 0}{\frac {H(A,\varepsilon )}{\log(1/\varepsilon )}}.} If they are equal, then dim E ⁡ A {\textstyle \operatorname {dim} _{E}A} is that common value, called the metric entropy dimension. The entropy dimensions are usually used in information theory, and especially coding theory, since entropy is involved in its definition. === Topological === If X {\displaystyle X} is merely a topological space, then we can still define its intrinsic dimension, using the topological dimension or Lebesgue covering dimension. An open cover of a topological space X is a family of open sets Uα such that their union is the whole space, ∪ α {\displaystyle \cup _{\alpha }} Uα = X. The order or ply of an open cover A {\displaystyle {\mathfrak {A}}} = {Uα} is the smallest number m (if it exists) for which each point of the space belongs to at most m open sets in the cover: in other words Uα1 ∩ ⋅⋅⋅ ∩ Uαm+1 = ∅ {\displaystyle \emptyset } for α1, ..., αm+1 distinct. A refinement of an open cover A {\displaystyle {\mathfrak {A}}} = {Uα} is another open cover B {\displaystyle {\mathfrak {B}}} = {Vβ}, such that each Vβ is contained in some Uα. The covering dimension of a topological space X is defined to be the minimum value of n such that every finite open cover A {\displaystyle {\mathfrak {A}}} of X has an open refinement B {\displaystyle {\mathfrak {B}}} with order n + 1. The refinement B {\displaystyle {\mathfrak {B}}} can always be chosen to be finite. Thus, if n is finite, Vβ1 ∩ ⋅⋅⋅ ∩ Vβn+2 = ∅ {\displaystyle \emptyset } for β1, ..., βn+2 distinct. If no such minimal n exists, the space is said to have infinite covering dimension. == Introductory example == Let f ( x 1 , x 2 ) {\textstyle f(x_{1},x_{2})} be a two-variable function (or signal) which is of the form f ( x 1 , x 2 ) = g ( x 1 ) {\textstyle f(x_{1},x_{2})=g(x_{1})} for some one-variable function g which is not constant. This means that f varies, in accordance to g, with the first variable or along the first coordinate. On the other hand, f is constant with respect to the second variable or along the second coordinate. It is only necessary to know the value of one, namely the first, variable in order to determine the value of f. Hence, it is a two-variable function but its intrinsic dimension is one. A slightly more complicated example is f ( x 1 , x 2 ) = g ( x 1 + x 2 ) {\textstyle f(x_{1},x_{2})=g(x_{1}+x_{2})} . f is still intrinsic one-dimensional, which can be seen by making a variable transformation y 1 = x 1 + x 2 {\textstyle y_{1}=x_{1}+x_{2}} and y 2 = x 1 − x 2 {\textstyle y_{2}=x_{1}-x_{2}} which gives f ( y 1 + y 2 2 , y 1 − y 2 2 ) = g ( y 1 ) {\textstyle f\left({\frac {y_{1}+y_{2}}{2}},{\frac {y_{1}-y_{2}}{2}}\right)=g\left(y_{1}\right)} . Since the variation in f can be described by the single variable y1 its intrinsic dimension is one. For the case that f is constant, its intrinsic dimension is zero since no variable is needed to describe variation. For the general case, when the intrinsic dimension of the two-variable function f is neither zero or one, it is two. In the literature, functions which are of intrinsic dimension zero, one, or two are sometimes referred to as i0D, i1D or i2D, respectively. == Signal processing == In signal processing of multidimensional signals, the intrinsic dimension of the signal describes how many variables are needed to generate a good approximation of the signal. For an N-variable function f, the set of variables can be represented as an N-dimensional vector x: f = f ( x ) where x = ( x 1 , … , x N ) {\textstyle f=f\left(\mathbf {x} \right){\text{ where }}\mathbf {x} =\left(x_{1},\dots ,x_{N}\right)} . If for some M-variable function g and M × N matrix A it is the case that for all x; f ( x ) = g ( A x ) , {\textstyle f(\mathbf {x} )=g(\mathbf {Ax} ),} M is the smallest number for which the above relation between f and g can be found, then the intrinsic dimension of f is M. The intrinsic dimension is a characterization of f, it is not an unambiguous characterization of g nor of A. That is, if the above relation is satisfied for some f, g, and A, it must also be satisfied for the same f and g′ and A′ given by g ′ ( y ) = g ( B y ) {\textstyle g'\left(\mathbf {y} \right)=g\left(\mathbf {By} \right)} and A ′ = B − 1 A {\textstyle \mathbf {A'} =\mathbf {B} ^{-1}\mathbf {A} } where B is a non-singular M × M matrix, since f ( x ) = g ′ ( A ′ x ) = g ( B A ′ x ) = g ( A x ) {\textstyle f\left(\mathbf {x} \right)=g'\left(\mathbf {A'x} \right)=g\left(\mathbf {BA'x} \right)=g\left(\mathbf {Ax} \right)} . == The Fourier transform of signals of low intrinsic dimension == An N variable function which has intrinsic dimension M < N has a characteristic Fourier transform. Intuitively, since this type of function is constant along one or several dimensions its Fourier transform must appear like an impulse (the Fourier transform of a constant) along the same dimension in the frequency domain. === A simple example === Let f be a two-variable function which is i1D. This means that there exists a normalized vector n ∈ R 2 {\textstyle \mathbf {n} \in \mathbb {R} ^{2}} and a one-variable function g such that f ( x ) = g ( n T x ) {\textstyle f(\mathbf {x} )=g(\mathbf {n} ^{\operatorname {T} }\mathbf {x} )} for all x ∈ R 2 {\textstyle \mathbf {x} \in \mathbb {R} ^{2}} . If F is the Fourier transform of f (both are two-variable functions) it must be the case that F ( u ) = G ( n T u ) ⋅ δ ( m T u ) {\textstyle F\left(\mathbf {u} \right)=G\left(\mathbf {n} ^{\mathrm {T} }\mathbf {u} \right)\cdot \delta \left(\mathbf {m} ^{\mathrm {T} }\mathbf {u} \right)} . Here G is the Fourier transform of g (both are one-variable functions), δ is the Dirac impulse function and m is a normalized vector in R 2 {\textstyle \mathbb {R} ^{2}} perpendicular to n. This means that F vanishes everywhere except on a line which passes through the origin of the frequency domain and is parallel to m. Along this line F varies according to G. === The general case === Let f be an N-variable function which has intrinsic dimension M, that is, there exists an M-variable function g and M × N matrix A such that f ( x ) = g ( A x ) ∀ x {\textstyle f(\mathbf {x} )=g(\mathbf {Ax} )\quad \forall \mathbf {x} } . Its Fourier transform F can then be described as follows: F vanishes everywhere except for a subspace of dimension M The subspace M is spanned by the rows of the matrix A In the subspace, F varies according to G the Fourier transform of g == Generalizations == The type of intrinsic dimension described above assume

    Read more →
  • Rule-based machine translation

    Rule-based machine translation

    Rule-based machine translation (RBMT) is a classical approach of machine translation systems based on linguistic information about source and target languages. Such information is retrieved from (unilingual, bilingual or multilingual) dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language. Having input sentences, an RBMT system generates output sentences on the basis of analysis of both the source and the target languages involved. RBMT has been progressively superseded by more efficient methods, particularly neural machine translation. == History == The first RBMT systems were developed in the early 1970s. The most important steps of this evolution were the emergence of the following RBMT systems: Systran Japanese MT systems Today, other common RBMT systems include: Apertium GramTrans == Types of RBMT == There are three different types of rule-based machine translation systems: Direct Systems (Dictionary Based Machine Translation) map input to output with basic rules. Transfer RBMT Systems (Transfer Based Machine Translation) employ morphological and syntactical analysis. Interlingual RBMT Systems (Interlingua) use an abstract meaning. RBMT systems can also be characterized as the systems opposite to Example-based Systems of Machine Translation (Example Based Machine Translation), whereas Hybrid Machine Translations Systems make use of many principles derived from RBMT. == Basic principles == The main approach of RBMT systems is based on linking the structure of the given input sentence with the structure of the demanded output sentence, necessarily preserving their unique meaning. The following example can illustrate the general frame of RBMT: A girl eats an apple. Source Language = English; Demanded Target Language = German Minimally, to get a German translation of this English sentence one needs: A dictionary that will map each English word to an appropriate German word. Rules representing regular English sentence structure. Rules representing regular German sentence structure. And finally, we need rules according to which one can relate these two structures together. Accordingly, we can state the following stages of translation: 1st: getting basic part-of-speech information of each source word: a = indef.article; girl = noun; eats = verb; an = indef.article; apple = noun 2nd: getting syntactic information about the verb "to eat": NP-eat-NP; here: eat – Present Simple, 3rd Person Singular, Active Voice 3rd: parsing the source sentence: (NP an apple) = the object of eat Often only partial parsing is sufficient to get to the syntactic structure of the source sentence and to map it onto the structure of the target sentence. 4th: translate English words into German a (category = indef.article) => ein (category = indef.article) girl (category = noun) => Mädchen (category = noun) eat (category = verb) => essen (category = verb) an (category = indef. article) => ein (category = indef.article) apple (category = noun) => Apfel (category = noun) 5th: Mapping dictionary entries into appropriate inflected forms (final generation): A girl eats an apple. => Ein Mädchen isst einen Apfel. == Ontologies == An ontology is a formal representation of knowledge that includes the concepts (such as objects, processes etc.) in a domain and some relations between them. If the stored information is of linguistic nature, one can speak of a lexicon. In NLP, ontologies can be used as a source of knowledge for machine translation systems. With access to a large knowledge base, rule-based systems can be enabled to resolve many (especially lexical) ambiguities on their own. In the following classic examples, as humans, we are able to interpret the prepositional phrase according to the context because we use our world knowledge, stored in our lexicons:I saw a man/star/molecule with a microscope/telescope/binoculars.Since the syntax does not change, a traditional rule-based machine translation system may not be able to differentiate between the meanings. With a large enough ontology as a source of knowledge however, the possible interpretations of ambiguous words in a specific context can be reduced. === Building ontologies === The ontology generated for the PANGLOSS knowledge-based machine translation system in 1993 may serve as an example of how an ontology for NLP purposes can be compiled: A large-scale ontology is necessary to help parsing in the active modules of the machine translation system. In the PANGLOSS example, about 50,000 nodes were intended to be subsumed under the smaller, manually-built upper (abstract) region of the ontology. Because of its size, it had to be created automatically. The goal was to merge the two resources LDOCE online and WordNet to combine the benefits of both: concise definitions from Longman, and semantic relations allowing for semi-automatic taxonomization to the ontology from WordNet. A definition match algorithm was created to automatically merge the correct meanings of ambiguous words between the two online resources, based on the words that the definitions of those meanings have in common in LDOCE and WordNet. Using a similarity matrix, the algorithm delivered matches between meanings including a confidence factor. This algorithm alone, however, did not match all meanings correctly on its own. A second hierarchy match algorithm was therefore created which uses the taxonomic hierarchies found in WordNet (deep hierarchies) and partially in LDOCE (flat hierarchies). This works by first matching unambiguous meanings, then limiting the search space to only the respective ancestors and descendants of those matched meanings. Thus, the algorithm matched locally unambiguous meanings (for instance, while the word seal as such is ambiguous, there is only one meaning of seal in the animal subhierarchy). Both algorithms complemented each other and helped constructing a large-scale ontology for the machine translation system. The WordNet hierarchies, coupled with the matching definitions of LDOCE, were subordinated to the ontology's upper region. As a result, the PANGLOSS MT system was able to make use of this knowledge base, mainly in its generation element. == Components == The RBMT system contains: a SL morphological analyser - analyses a source language word and provides the morphological information; a SL parser - is a syntax analyser which analyses source language sentences; a translator - used to translate a source language word into the target language; a TL morphological generator - works as a generator of appropriate target language words for the given grammatica information; a TL parser - works as a composer of suitable target language sentences; Several dictionaries - more specifically a minimum of three dictionaries: a SL dictionary - needed by the source language morphological analyser for morphological analysis, a bilingual dictionary - used by the translator to translate source language words into target language words, a TL dictionary - needed by the target language morphological generator to generate target language words. The RBMT system makes use of the following: a Source Grammar for the input language which builds syntactic constructions from input sentences; a Source Lexicon which captures all of the allowable vocabulary in the domain; Source Mapping Rules which indicate how syntactic heads and grammatical functions in the source language are mapped onto domain concepts and semantic roles in the interlingua; a Domain Model/Ontology which defines the classes of domain concepts and restricts the fillers of semantic roles for each class; Target Mapping Rules which indicate how domain concepts and semantic roles in the interlingua are mapped onto syntactic heads and grammatical functions in the target language; a Target Lexicon which contains appropriate target lexemes for each domain concept; a Target Grammar for the target language which realizes target syntactic constructions as linearized output sentences. == Advantages == No bilingual texts are required. This makes it possible to create translation systems for languages that have no texts in common, or even no digitized data whatsoever. Domain independent. Rules are usually written in a domain independent manner, so the vast majority of rules will "just work" in every domain, and only a few specific cases per domain may need rules written for them. No quality ceiling. Every error can be corrected with a targeted rule, even if the trigger case is extremely rare. This is in contrast to statistical systems where infrequent forms will be washed away by default. Total control. Because all rules are hand-written, you can easily debug a rule-based system to see exactly where a given error enters the system, and why. Reusability. Because RBMT systems are generally built from a strong source language analysis that is fed to a transfer step and target language generator, the source language analysis and targe

    Read more →
  • LanguageWare

    LanguageWare

    LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries that provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses a finite-state machine approach at multiple levels, which aids its performance characteristics while maintaining a reasonably small footprint. The behaviour of the system is driven by a set of configurable lexico-semantic resources which describe the characteristics and domain of the processed language. A default set of resources comes as part of LanguageWare and these describe the native language characteristics, such as morphology, and the basic vocabulary for the language. Supplemental resources have been created that capture additional vocabularies, terminologies, rules and grammars, which may be generic to the language or specific to one or more domains. A set of Eclipse-based customization tooling, LanguageWare Resource Workbench, is available on IBM's alphaWorks site, and allows domain knowledge to be compiled into these resources and thereby incorporated into the analysis process. LanguageWare can be deployed as a set of UIMA-compliant annotators, Eclipse plug-ins or Web Services.

    Read more →
  • Microsoft Forms

    Microsoft Forms

    Microsoft Forms (formerly Office 365 Forms) is an online survey creator, part of Microsoft 365. == Usage == Forms allows users to create surveys and quizzes with automatic marking. The data can be exported to Microsoft Excel, Power BI dashboards and viewed live using the Present feature. == Phishing and fraud == Due to a wave of phishing attacks utilizing Microsoft 365 in early 2021, Microsoft uses algorithms to automatically detect and block phishing attempts with Microsoft Forms. Also, Microsoft advises Forms users not to submit personal information, such as passwords, in a form or survey. It also place a similar advisory underneath the “Submit” button in every form created with Forms, warning users not to give out their password.

    Read more →
  • Huawei Member Center

    Huawei Member Center

    Huawei Member Center is a benefits app which runs using Huawei Mobile Services. Originally launched in China, Huawei Member Center is now being developed primarily around devices such as P40 Pro and the Nova 7. == Membership Levels == The Huawei Member Center provides rewards in two primary ways, 1) device-specific & promotions and 2) via frequent use of Huawei products and apps, using points to redeem additional benefits. In China, Huawei members are already classified into three levels, the highest being “elite”. Membership level determines the level of perks received, from priority access to the service hotline, new device events & proprietary early-access opportunities. Huawei ran a number of member events in 2019 called "Huawei Member Day" to promote the Member Center including providing tips for the Mate 30 Pro and offering a 50Gb cloud storage upgrade to users. == HMC in China == Huawei Member Center Has seen significant adoption in China and the east, the rewards for use on the app have ranged from free book coupons, discounted travel and exclusive gifts of new devices, such as the Huawei Enjoy Z.

    Read more →
  • LRE Map

    LRE Map

    The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map". The LRE Map is intended to be an instrument for collecting information about language resources and to become, at the same time, a community for users, a place to share and discover resources, discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools. The large amount of information contained in the Map can be analyzed in many different ways. For instance, the LRE Map can provide information about the most frequent type of resource, the most represented language, the applications for which resources are used or are being developed, the proportion of new resources vs. already existing ones, or the way in which resources are distributed to the community. == Context == Several institutions worldwide maintain catalogues of language resources (ELRA, LDC, NICT Universal Catalogue, ACL Data and Code Repository, OLAC, LT World, etc.) However, it has been estimated that only 10% of existing resources are known, either through distribution catalogues or via direct publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it briefly emerges being when a resource is presented in the context of a research paper or report at some conference. Even in this case, nevertheless, it might be that a resource remains in the background simply because the focus of the research is not on the resource per se. == History == The LRE Map originated under the name "LREC Map" during the preparation of LREC 2010 conference. More specifically, the idea was discussed within the FlaReNet project, and in collaboration with ELRA and the Institute of Computational Linguistics of CNR in Pisa, the Map was put in place at LREC 2010. The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map. The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011, LREC 2012, LREC 2014 and LREC 2016. After this generalization to other conferences, the LREC Map has been renamed as the LRE Map. == Size and content == The size of the database increases over time. The data collected amount to 4776 entries. Each resource is described according to the following attributes: Resource type, e.g. lexicon, annotation tool, tagger/parser. Resource production status, e.g. newly created finished, existing-updated. Resource availability, e.g. freely available, from data center. Resource modality, e.g. speech, written, sign language. Resource use, e.g. named entity recognition, language identification, machine translation. Resource language, e.g. English, 23 European Union languages, official languages of India. == Uses == The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts. The map has a great potential for many uses, in addition to being an information gathering tool: It is a great instrument for monitoring the evolution of the field (useful for funders), if applied in different contexts and times. It can be seen as a huge joint effort, the beginning of an even larger cooperative action not just among few leaders but among all the researchers. It is also an "educational" means towards the broad acknowledgment of the need of meta-research activities with the active involvement of many. It is also instrumental in introducing the new notion of "citation of resources" that could provide an award and a means of scholarly recognition for researchers engaged in resource creation. It is used to help the organization of the conferences of the field like LREC. == Derived matrices == The data were then cleaned and sorted by Joseph Mariani (CNRS-LIMSI IMMI) and Gil Francopoulo (CNRS-LIMSI IMMI + Tagmatica) in order to compute the various matrices of the final FLaReNet reports. One of them, the matrix for written data at LREC 2010 is as follows: English is the most studied language. Secondly, come French and German languages and then Italian and Spanish. == Future == The LRE Map has been extended to Language Resources and Evaluation Journal and other conferences.

    Read more →