AI Data Facility

AI Data Facility — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Graphics

    Graphics

    Graphics (from Ancient Greek γραφικός (graphikós) 'pertaining to drawing, painting, writing, etc.') are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of data, as in design and manufacture, in typesetting and the graphic arts, and in educational and recreational software. Images that are generated by a computer are called computer graphics. Examples are photographs, drawings, line art, mathematical graphs, line graphs, charts, diagrams, typography, numbers, symbols, geometric designs, maps, engineering drawings, or other images. Graphics often combine text, illustration, and color. Graphic design may consist of the deliberate selection, creation, or arrangement of typography alone, as in a brochure, flyer, poster, web site, or book without any other element. The objective can be clarity or effective communication, association with other cultural elements, or merely the creation of a distinctive style. Graphics can be functional or artistic. The latter can be a recorded version, such as a photograph, or an interpretation by a scientist to highlight essential features, or an artist, in which case the distinction with imaginary graphics may become blurred. It can also be used for architecture. == History == The earliest graphics known to anthropologists studying prehistoric periods are cave paintings and markings on boulders, bone, ivory, and antlers, which were created during the Upper Palaeolithic period from 40,000 to 10,000 B.C. or earlier. Many of these were found to record astronomical, seasonal, and chronological details. Some of the earliest graphics and drawings are known to the modern world, from almost 6,000 years ago, are that of engraved stone tablets and ceramic cylinder seals, marking the beginning of the historical periods and the keeping of records for accounting and inventory purposes. Records from Egypt predate these and papyrus was used by the Egyptians as a material on which to plan the building of pyramids; they also used slabs of limestone and wood. From 600 to 250 BC, the Greeks played a major role in geometry. They used graphics to represent their mathematical theories such as the Circle Theorem and the Pythagorean theorem. In art, "graphics" is often used to distinguish work in a monotone and made up of lines, as opposed to painting. === Drawing === Drawing generally involves making marks on a surface by applying pressure from a tool or moving a tool across a surface. In which a tool is always used as if there were no tools it would be art. Graphical drawing is an instrumental guided drawing. === Printmaking === Woodblock printing, including images is first seen in China after paper was invented (about A.D. 105). In the West, the main techniques have been woodcut, engraving and etching, but there are many others. ==== Etching ==== Etching is an intaglio method of printmaking in which the image is incised into the surface of a metal plate using an acid. The acid eats the metal, leaving behind roughened areas, or, if the surface exposed to the acid is very thin, burning a line into the plate. The use of the process in printmaking is believed to have been invented by Daniel Hopfer (c. 1470–1536) of Augsburg, Germany, who decorated armour in this way. Etching is also used in the manufacturing of printed circuit boards and semiconductor devices. === Line art === Line art is a rather non-specific term sometimes used for any image that consists of distinct straight and curved lines placed against a (usually plain) background, without gradations in shade (darkness) or hue (color) to represent two-dimensional or three-dimensional objects. Line art is usually monochromatic, although lines may be of different colors. === Illustration === An illustration is a visual representation such as a drawing, painting, photograph or other work of art that stresses the subject more than form. The aim of an illustration is to elucidate or decorate a story, poem or piece of textual information (such as a newspaper article), traditionally by providing a visual representation of something described in the text. The editorial cartoon, also known as a political cartoon, is an illustration containing a political or social message. Illustrations can be used to display a wide range of subject matter and serve a variety of functions, such as: giving faces to characters in a story displaying a number of examples of an item described in an academic textbook (e.g. A Typology) visualizing step-wise sets of instructions in a technical manual communicating subtle thematic tone in a narrative linking brands to the ideas of human expression, individuality, and creativity making a reader laugh or smile for fun (to make laugh) funny === Graphs === A graph or chart is a graphic that represents tabular or numeric data. Charts are often used to make it easier to understand large quantities of data and the relationships between different parts of the data. === Diagrams === A diagram is a simplified and structured visual representation of concepts, ideas, constructions, relations, statistical data, etc., used to visualize and clarify the topic. === Symbols === A symbol, in its basic sense, is a representation of a concept or quantity; i.e., an idea, object, concept, quality, etc. In more psychological and philosophical terms, all concepts are symbolic in nature, and representations for these concepts are simply token artifacts that are allegorical to (but do not directly codify) a symbolic meaning, or symbolism. === Maps === A map is a simplified depiction of a space, a navigational aid which highlights relations between objects within that space. Usually, a map is a two-dimensional, geometrically accurate representation of a three-dimensional space. One of the first 'modern' maps was made by Waldseemüller. === Photography === One difference between photography and other forms of graphics is that a photographer, in principle, just records a single moment in reality, with seemingly no interpretation. However, a photographer can choose the field of view and angle, and may also use other techniques, such as various lenses to choose the view or filters to change the colors. In recent times, digital photography has opened the way to an infinite number of fast, but strong, manipulations. Even in the early days of photography, there was controversy over photographs of enacted scenes that were presented as 'real life' (especially in war photography, where it can be very difficult to record the original events). Shifting the viewer's eyes ever so slightly with simple pinpricks in the negative could have a dramatic effect. The choice of the field of view can have a strong effect, effectively 'censoring out' other parts of the scene, accomplished by cropping them out or simply not including them in the photograph. This even touches on the philosophical question of what reality is. The human brain processes information based on previous experience, making us see what we want to see or what we were taught to see. Photography does the same, although the photographer interprets the scene for their viewer. === Engineering drawings === An engineering drawing is a type of drawing and is technical in nature, used to fully and clearly define requirements for engineered items. It is usually created in accordance with standardized conventions for layout, nomenclature, interpretation, appearance (such as typefaces and line styles), size, etc. === Computer graphics === There are two types of computer graphics: raster graphics, where each pixel is separately defined (as in a digital photograph), and vector graphics, where mathematical formulas are used to draw lines and shapes, which are then interpreted at the viewer's end to produce the graphic. Using vectors results in infinitely sharp graphics and often smaller files, but, when complex, like vectors take time to render and may have larger file sizes than a raster equivalent. In 1950, the first computer-driven display was attached to MIT's Whirlwind I computer to generate simple pictures. This was followed by MIT's TX-0 and TX-2, interactive computing which increased interest in computer graphics during the late 1950s. In 1962, Ivan Sutherland invented Sketchpad, an innovative program that influenced alternative forms of interaction with computers. In the mid-1960s, large computer graphics research projects were begun at MIT, General Motors, Bell Labs, and Lockheed Corporation. Douglas T. Ross of MIT developed an advanced compiler language for graphics programming. S.A.Coons, also at MIT, and J. C. Ferguson at Boeing, began work in sculptured surfaces. GM developed their DAC-1 system, and other companies, such as Douglas, Lockheed, and McDonnell, also made significant developments. In 1968, ray tracing was first described by Arthur Appel of the IBM Research Center, Yorktown Heights, N

    Read more →
  • Recursive neural network

    Recursive neural network

    A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order. These networks were first introduced to learn distributed representations of structure (such as logical terms), but have been successful in multiple applications, for instance in learning sequence and tree structures in natural language processing (mainly continuous representations of phrases and sentences based on word embeddings). == Architectures == === Basic === In the simplest architecture, nodes are combined into parents using a weight matrix (which is shared across the whole network) and a non-linearity such as the tanh {\displaystyle \tanh } hyperbolic function. If c 1 {\displaystyle c_{1}} and c 2 {\displaystyle c_{2}} are n {\displaystyle n} -dimensional vector representations of nodes, their parent will also be an n {\displaystyle n} -dimensional vector, defined as: p 1 , 2 = tanh ⁡ ( W [ c 1 ; c 2 ] ) {\displaystyle p_{1,2}=\tanh(W[c_{1};c_{2}])} where W {\displaystyle W} is a learned n × 2 n {\displaystyle n\times 2n} weight matrix. This architecture, with a few improvements, has been used for successfully parsing natural scenes, syntactic parsing of natural language sentences, and recursive autoencoding and generative modeling of 3D shape structures in the form of cuboid abstractions. === Recursive cascade correlation (RecCC) === RecCC is a constructive neural network approach to deal with tree domains with pioneering applications to chemistry and extension to directed acyclic graphs. === Unsupervised RNN === A framework for unsupervised RNN has been introduced in 2004. === Tensor === Recursive neural tensor networks use a single tensor-based composition function for all nodes in the tree. == Training == === Stochastic gradient descent === Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using backpropagation through structure (BPTS), a variant of backpropagation through time used for recurrent neural networks. == Properties == The universal approximation capability of RNNs over trees has been proved in literature. == Related models == === Recurrent neural networks === Recurrent neural networks are recursive artificial neural networks with a certain structure: that of a linear chain. Whereas recursive neural networks operate on any hierarchical structure, combining child representations into parent representations, recurrent neural networks operate on the linear progression of time, combining the previous time step and a hidden representation into the representation for the current time step. === Tree Echo State Networks === An efficient approach to implement recursive neural networks is given by the Tree Echo State Network within the reservoir computing paradigm. === Extension to graphs === Extensions to graphs include graph neural network (GNN), Neural Network for Graphs (NN4G), and more recently convolutional neural networks for graphs.

    Read more →
  • Linear classifier

    Linear classifier

    In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. A simpler definition is to say that a linear classifier is one whose decision boundaries are linear. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. == Definition == If the input feature vector to the classifier is a real vector x → {\displaystyle {\vec {x}}} , then the output score is y = f ( w → ⋅ x → ) = f ( ∑ j w j x j ) , {\displaystyle y=f({\vec {w}}\cdot {\vec {x}})=f\left(\sum _{j}w_{j}x_{j}\right),} where w → {\displaystyle {\vec {w}}} is a real vector of weights and f is a function that converts the dot product of the two vectors into the desired output. (In other words, w → {\displaystyle {\vec {w}}} is a one-form or linear functional mapping x → {\displaystyle {\vec {x}}} onto R.) The weight vector w → {\displaystyle {\vec {w}}} is learned from a set of labeled training samples. Often f is a threshold function, which maps all values of w → ⋅ x → {\displaystyle {\vec {w}}\cdot {\vec {x}}} above a certain threshold to the first class and all other values to the second class; e.g., f ( x ) = { 1 if w T ⋅ x > θ , 0 otherwise {\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} ^{T}\cdot \mathbf {x} >\theta ,\\0&{\text{otherwise}}\end{cases}}} The superscript T indicates the transpose and θ {\displaystyle \theta } is a scalar threshold. A more complex f might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when x → {\displaystyle {\vec {x}}} is sparse. Also, linear classifiers often work very well when the number of dimensions in x → {\displaystyle {\vec {x}}} is large, as in document classification, where each element in x → {\displaystyle {\vec {x}}} is typically the number of occurrences of a word in a document (see document-term matrix). In such cases, the classifier should be well-regularized. == Generative models vs. discriminative models == There are two broad classes of methods for determining the parameters of a linear classifier w → {\displaystyle {\vec {w}}} . They can be generative and discriminative models. Methods of the former model joint probability distribution, whereas methods of the latter model conditional density functions P ( c l a s s | x → ) {\displaystyle P({\rm {class}}|{\vec {x}})} . Examples of such algorithms include: Linear Discriminant Analysis (LDA)—assumes Gaussian conditional density models Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: Logistic regression—maximum likelihood estimation of w → {\displaystyle {\vec {w}}} assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. Perceptron—an algorithm that attempts to fix all errors encountered in the training set Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. Support vector machine—an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: principal components analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space φ ( x → ) {\displaystyle \varphi ({\vec {x}})} , using the kernel trick. === Discriminative training === Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an optimization algorithm that is given a training set with desired outputs and a loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form arg ⁡ min w R ( w ) + C ∑ i = 1 N L ( y i , w T x i ) {\displaystyle {\underset {\mathbf {w} }{\arg \min }}\;R(\mathbf {w} )+C\sum _{i=1}^{N}L(y_{i},\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i})} where w is a vector of classifier parameters, L(yi, wTxi) is a loss function that measures the discrepancy between the classifier's prediction and the true output yi for the i'th training example, R(w) is a regularization function that prevents the parameters from getting too large (causing overfitting), and C is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function R is convex, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (stochastic) gradient descent, L-BFGS, coordinate descent and Newton methods.

    Read more →
  • ARKA descriptors in QSAR

    ARKA descriptors in QSAR

    In computational chemistry and cheminformatics, ARKA descriptors in QSAR are a class of molecular descriptors used in quantitative structure–activity relationship (QSAR) modeling (or related approaches such as QSPR and QSTR), a computational method for predicting the biological activity or toxicity of chemical compounds based on their molecular structure. Molecular descriptors are numerical values that summarize information about a molecule's structure, topology, geometry, or physicochemical properties in a form suitable for machine learning or statistical modeling. ARKA (Arithmetic Residuals in K-Groups Analysis) descriptors differ from traditional descriptors by encoding atomic-level information through recursive autoregression techniques, which aim to capture subtle structural patterns and improve predictive accuracy. They are designed to be both interpretable and well-suited to modeling nonlinear relationships in QSAR studies. == Comparisons == While QSAR is essentially a similarity-based approach, the occurrence of activity/property cliffs may greatly reduce the predictive accuracy of the developed models. The novel Arithmetic Residuals in K-groups Analysis (ARKA) approach is a supervised dimensionality reduction technique developed by the DTC Laboratory, Jadavpur University that can easily identify activity cliffs in a data set. Activity cliffs are similar in their structures but differ considerably in their activity. The basic idea of the ARKA descriptors is to group the conventional QSAR descriptors based on a predefined criterion and then assign weightage to each descriptor in each group. ARKA descriptors have also been used to develop classification-based and regression-based QSAR models with acceptable quality statistics. The ARKA descriptors have been used for the identification of activity cliffs in QSAR studies and/or model development by multiple researchers. A tutorial presentation on the ARKA descriptors is available. Recently a multi-class ARKA framework has been proposed for improved q-RASAR model generation.

    Read more →
  • ROCm

    ROCm

    ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains, including general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), and heterogeneous computing. It offers several programming models: HIP (GPU-kernel-based programming), OpenMP (directive-based programming), and OpenCL. ROCm is free, libre and open-source software (except the GPU firmware blobs), and it is distributed under various licenses. The name initially stood for Radeon Open Compute platform; however, due to Open Compute being a registered trademark, the name no longer functions as an acronym. == Background == The first GPGPU software stack from ATI/AMD was Close to Metal, which became Stream. ROCm was launched around 2016 with the Boltzmann Initiative. ROCm stack builds upon previous AMD GPU stacks; some tools trace back to GPUOpen and others to the Heterogeneous System Architecture (HSA). === Heterogeneous System Architecture Intermediate Language === HSAIL was aimed at producing a middle-level, hardware-agnostic intermediate representation that could be JIT-compiled to the eventual hardware (GPU, FPGA...) using the appropriate finalizer. This approach was dropped for ROCm: now it builds only GPU code, using LLVM, and its AMDGPU backend that was upstreamed, although there is still research on such enhanced modularity with LLVM MLIR. == Programming abilities == ROCm as a stack ranges from the kernel driver to the end-user applications. AMD has introductory videos about AMD GCN hardware, and ROCm programming via its learning portal. One of the best technical introductions about the stack and ROCm/HIP programming, remains, to date, to be found on Reddit. == Hardware support == ROCm is primarily targeted at discrete professional GPUs, but consumer GPUs and APUs of the same architecture as a supported professional GPU are known to work with ROCm. For example, all professional GPUs of the RDNA 2 architecture are officially supported by ROCm 5.x; users report that Consumer RDNA2 units such as the Radeon 6800M APU and the Radeon 6700XT GPU also work. === Professional-grade GPUs === === Consumer-grade GPUs === == Software ecosystem == === Machine learning === Various deep learning frameworks have a ROCm backend: PyTorch TensorFlow ONNX MXNet CuPy MIOpen Caffe Iree (which uses LLVM Multi-Level Intermediate Representation (MLIR)) llama.cpp === Supercomputing === ROCm is gaining significant traction in the top 500. ROCm is used with the Exascale supercomputers El Capitan and Frontier. Some related software is to be found at AMD Infinity hub. === Other acceleration & graphics interoperation === As of version 3.0, Blender can now use HIP compute kernels for its renderer cycles. === Other languages === ==== Julia ==== Julia has the AMDGPU.jl package, which integrates with LLVM and selects components of the ROCm stack. Instead of compiling code through HIP, AMDGPU.jl uses Julia's compiler to generate LLVM IR directly, which is later consumed by LLVM to generate native device code. AMDGPU.jl uses ROCr's HSA implementation to upload native code onto the device and execute it, similar to how HIP loads its own generated device code. AMDGPU.jl also supports integration with ROCm's rocBLAS (for BLAS), rocRAND (for random number generation), and rocFFT (for FFTs). Future integration with rocALUTION, rocSOLVER, MIOpen, and certain other ROCm libraries is planned. === Software distribution === ==== Official ==== Installation instructions are provided for Linux and Windows in the official AMD ROCm documentation. ROCm software is currently spread across several public GitHub repositories. Within the main public meta-repository, there is an XML manifest for each official release: using git-repo, a version control tool built on top of Git, is the recommended way to synchronize with the stack locally. AMD starts distributing containerized applications for ROCm, notably scientific research applications gathered under AMD Infinity Hub. AMD distributes itself packages tailored to various Linux distributions. ==== Third-party ==== There is a growing third-party ecosystem packaging ROCm. Linux distributions are officially packaging (natively) ROCm, with various degrees of advancement: Arch Linux, Gentoo, Debian, Fedora , GNU Guix, and NixOS. There are Spack packages. == Components == There is one kernel-space component, ROCk, and the rest - there is roughly a hundred components in the stack - is made of user-space modules. The unofficial typographic policy is to use: uppercase ROC lowercase following for low-level libraries, i.e. ROCt, and the contrary for user-facing libraries, i.e. rocBLAS. AMD is active developing with the LLVM community, but upstreaming is not instantaneous, and as of January 2022, is still lagging. AMD still officially packages various LLVM forks for parts that are not yet upstreamed – compiler optimizations destined to remain proprietary, debug support, OpenMP offloading, etc. === Low-level === ==== ROCk – Kernel driver ==== ==== ROCm – Device libraries ==== Support libraries implemented as LLVM bitcode. These provide various utilities and functions for math operations, atomics, queries for launch parameters, on-device kernel launch, etc. ==== ROCt – Thunk ==== The thunk is responsible for all the thinking and queuing that goes into the stack. ==== ROCr – Runtime ==== The ROC runtime is a set of APIs/libraries that allows the launch of compute kernels by host applications. It is AMD's implementation of the HSA runtime API. It is different from the ROC Common Language Runtime. ==== ROCm – CompilerSupport ==== ROCm code object manager is in charge of interacting with LLVM intermediate representation. === Mid-level === ==== ROCclr Common Language Runtime ==== The common language runtime is an indirection layer adapting calls to ROCr on Linux and PAL on windows. It used to be able to route between different compilers, like the HSAIL-compiler. It is now being absorbed by the upper indirection layers (HIP and OpenCL). ==== OpenCL ==== ROCm ships its installable client driver (ICD) loader and an OpenCL implementation bundled together. As of January 2022, ROCm 4.5.2 ships OpenCL 2.2, and is lagging behind competition. ==== HIP – Heterogeneous Interface for Portability ==== The AMD implementation for its GPUs is called HIPAMD. There is also a CPU implementation mostly for demonstration purposes. ==== HIPCC ==== HIP builds a `HIPCC` compiler that either wraps Clang and compiles with LLVM open AMDGPU backend, or redirects to the NVIDIA compiler. ==== HIPIFY ==== HIPIFY is a source-to-source compiling tool. It translates CUDA to HIP and reverse, either using a Clang-based tool, or a sed-like Perl script. ==== GPUFORT ==== Like HIPIFY, GPUFORT is a tool compiling source code into other third-generation-language sources, allowing users to migrate from CUDA Fortran to HIP Fortran. It is also in the repertoire of research projects, even more so. === High-level === ROCm high-level libraries are usually consumed directly by application software, such as machine learning frameworks. Most of the following libraries are in the General Matrix Multiply (GEMM) category, which GPU architecture excels at. The majority of these user-facing libraries comes in dual-form: hip for the indirection layer that can route to Nvidia hardware, and roc for the AMD implementation. ==== rocBLAS / hipBLAS ==== rocBLAS and hipBLAS are central in high-level libraries, it is the AMD implementation for Basic Linear Algebra Subprograms. It uses the library Tensile privately. ==== rocSOLVER / hipSOLVER ==== This pair of libraries constitutes the LAPACK implementation for ROCm and is strongly coupled to rocBLAS. === Utilities === ROCm developer tools: Debug, tracer, profiler, System Management Interface, Validation suite, Cluster management. GPUOpen tools: GPU analyzer, memory visualizer... External tools: radeontop (TUI overview) == Comparison with competitors == ROCm competes with other GPU computing stacks: Nvidia CUDA and Intel OneAPI. === Nvidia CUDA === Nvidia's CUDA is closed-source, whereas AMD ROCm is open source. There is open-source software built on top of the closed-source CUDA, for instance RAPIDS. CUDA is able to run on consumer GPUs, whereas ROCm support is mostly offered for professional hardware such as AMD Instinct and AMD Radeon Pro. Nvidia provides a C/C++-centered frontend and its Parallel Thread Execution (PTX) LLVM GPU backend as the Nvidia CUDA Compiler (NVCC). === Intel OneAPI === All the oneAPI corresponding libraries are published on its GitHub Page. ==== Unified Acceleration Foundation (UXL) ==== Unified Acceleration Foundation (UXL) is a new technology consortium that are working on the continuation of the OneAPI initiative, with the goal to create a new open standard accelerator software ecosystem, related open standards and specification projects through Working Groups and Specia

    Read more →
  • Win–stay, lose–switch

    Win–stay, lose–switch

    In psychology, game theory, statistics, and machine learning, win–stay, lose–switch (also win–stay, lose–shift or Pavlov, named after Ivan Pavlov) is a heuristic learning strategy used to model learning in decision situations. It was first invented as an improvement over randomization in bandit problems. It was later applied to the prisoner's dilemma in order to model the evolution of altruism. In most versions, it starts either with a cooperate, then proceeds as always, or starts with a "probe" of cooperate-defect-cooperate to determine the other player's strategy. A mutual cooperation is regarded as a win. The learning rule bases its decision only on the outcome of the previous play. Outcomes are divided into successes (wins) and failures (losses). If the play on the previous round resulted in a success, then the agent plays the same strategy on the next round. Alternatively, if the play resulted in a failure the agent switches to another action. A large-scale empirical study of players of the game rock, paper, scissors shows that a variation of this strategy is adopted by real-world players of the game, instead of the Nash equilibrium strategy of choosing entirely at random between the three options.

    Read more →
  • PVLV

    PVLV

    The primary value learned value (PVLV) model is a possible explanation for the reward-predictive firing properties of dopamine (DA) neurons. It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an alternative to the temporal-differences (TD) algorithm. It is used as part of Leabra.

    Read more →
  • Distributional Soft Actor Critic

    Distributional Soft Actor Critic

    Distributional Soft Actor Critic (DSAC) is a suite of model-free off-policy reinforcement learning algorithms, tailored for learning decision-making or control policies in complex systems with continuous action spaces. Distinct from traditional methods that focus solely on expected returns, DSAC algorithms are designed to learn a Gaussian distribution over stochastic returns, called value distribution. This focus on Gaussian value distribution learning notably diminishes value overestimations, which in turn boosts policy performance. Additionally, the value distribution learned by DSAC can also be used for risk-aware policy learning. From a technical standpoint, DSAC is essentially a distributional adaptation of the well-established soft actor-critic (SAC) method. To date, the DSAC family comprises two iterations: the original DSAC-v1 and its successor, DSAC-T (also known as DSAC-v2), with the latter demonstrating superior capabilities over the Soft Actor-Critic (SAC) in Mujoco benchmark tasks. The source code for DSAC-T can be found at the following URL: Jingliang-Duan/DSAC-T. Both iterations have been integrated into an advanced, Pytorch-powered reinforcement learning toolkit named GOPS: GOPS (General Optimal control Problem Solver).

    Read more →
  • Flexidraw

    Flexidraw

    Flexidraw is a 1985 graphics computer program published by Inkwell Systems. == Gameplay == Flexidraw is a graphics program that allows users to produce drawings using a light pen and print them. == Reception == Roy Wagner reviewed the product for Computer Gaming World, and stated that "Of the many graphics programs available Flexidraw is certainly the best supported by it's [sic] parent company."

    Read more →
  • Promoter based genetic algorithm

    Promoter based genetic algorithm

    The promoter based genetic algorithm (PBGA) is a genetic algorithm for neuroevolution developed by F. Bellas and R.J. Duro in the Integrated Group for Engineering Research (GII) at the University of Coruña, in Spain. It evolves variable size feedforward artificial neural networks (ANN) that are encoded into sequences of genes for constructing a basic ANN unit. Each of these blocks is preceded by a gene promoter acting as an on/off switch that determines if that particular unit will be expressed or not. == PBGA basics == The basic unit in the PBGA is a neuron with all of its inbound connections as represented in the following figure: The genotype of a basic unit is a set of real valued weights followed by the parameters of the neuron and proceeded by an integer valued field that determines the promoter gene value and, consequently, the expression of the unit. By concatenating units of this type we can construct the whole network. With this encoding it is imposed that the information that is not expressed is still carried by the genotype in evolution but it is shielded from direct selective pressure, maintaining this way the diversity in the population, which has been a design premise for this algorithm. Therefore, a clear difference is established between the search space and the solution space, permitting information learned and encoded into the genotypic representation to be preserved by disabling promoter genes. == Results == The PBGA was originally presented within the field of autonomous robotics, in particular in the real time learning of environment models of the robot. It has been used inside the Multilevel Darwinist Brain (MDB) cognitive mechanism developed in the GII for real robots on-line learning. In another paper it is shown how the application of the PBGA together with an external memory that stores the successful obtained world models, is an optimal strategy for adaptation in dynamic environments. Recently, the PBGA has provided results that outperform other neuroevolutionary algorithms in non-stationary problems, where the fitness function varies in time.

    Read more →
  • Consensus clustering

    Consensus clustering

    Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning. == Issues with existing clustering techniques == Current clustering techniques do not address all the requirements adequately. Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; Effectiveness of the method depends on the definition of "distance" (for distance-based clustering) If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways. == Justification for using consensus clustering == There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and k-means clustering circumvent some of the shortcomings of hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori. == The Monti consensus clustering algorithm == The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, K {\displaystyle K} . Given a dataset of N {\displaystyle N} total number of points to cluster, this algorithm works by resampling and clustering the data, for each K {\displaystyle K} and a N × N {\displaystyle N\times N} consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal K {\displaystyle K} . More specifically, given a set of points to cluster, D = { e 1 , e 2 , . . . e N } {\displaystyle D=\{e_{1},e_{2},...e_{N}\}} , let D 1 , D 2 , . . . , D H {\displaystyle D^{1},D^{2},...,D^{H}} be the list of H {\displaystyle H} perturbed (resampled) datasets of the original dataset D {\displaystyle D} , and let M h {\displaystyle M^{h}} denote the N × N {\displaystyle N\times N} connectivity matrix resulting from applying a clustering algorithm to the dataset D h {\displaystyle D^{h}} . The entries of M h {\displaystyle M^{h}} are defined as follows: M h ( i , j ) = { 1 , if points i and j belong to the same cluster 0 , otherwise {\displaystyle M^{h}(i,j)={\begin{cases}1,&{\text{if}}{\text{ points i and j belong to the same cluster}}\\0,&{\text{otherwise}}\end{cases}}} Let I h {\displaystyle I^{h}} be the N × N {\displaystyle N\times N} identicator matrix where the ( i , j ) {\displaystyle (i,j)} -th entry is equal to 1 if points i {\displaystyle i} and j {\displaystyle j} are in the same perturbed dataset D h {\displaystyle D^{h}} , and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix C {\displaystyle C} is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every K {\displaystyle K} . C ( i , j ) = ( ∑ h = 1 H M h ( i , j ) ∑ h = 1 H I h ( i , j ) ) {\displaystyle C(i,j)=\left({\frac {\textstyle \sum _{h=1}^{H}M^{h}(i,j)\displaystyle }{\sum _{h=1}^{H}I^{h}(i,j)}}\right)} That is the entry ( i , j ) {\displaystyle (i,j)} in the consensus matrix is the number of times points i {\displaystyle i} and j {\displaystyle j} were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range [ 0 , 1 ] {\displaystyle [0,1]} . A consensus matrix is calculated for each K {\displaystyle K} to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal K {\displaystyle K} . One way of quantifying the stability of the K {\displaystyle K} th consensus matrix is examining its CDF curve (see below). == Over-interpretation potential of the Monti consensus clustering algorithm == Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study. If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research, and has been addressed by methods such as SigClust and the GAP-statistic. However, these methods rely on certain assumptions for the null model that may not always be appropriate. Şenbabaoğlu et al demonstrated the original delta K metric to decide K {\displaystyle K} in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the K {\displaystyle K} value having the lowest PAC. == Related work == Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm

    Read more →
  • Latent and observable variables

    Latent and observable variables

    In statistics, latent variables (from Latin: present participle of lateo 'lie hidden') are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including engineering, medicine, ecology, physics, machine learning/artificial intelligence, natural language processing, bioinformatics, chemometrics, demography, economics, management, political science, psychology and the social sciences. Latent variables may correspond to aspects of physical reality. These could in principle be measured, but may not be for practical reasons. Among the earliest expressions of this idea is Francis Bacon's polemic the Novum Organum, itself a challenge to the more traditional logic expressed in Aristotle's Organon: But the latent process of which we speak, is far from being obvious to men’s minds, beset as they now are. For we mean not the measures, symptoms, or degrees of any process which can be exhibited in the bodies themselves, but simply a continued process, which, for the most part, escapes the observation of the senses. In this situation, the term hidden variables is commonly used, reflecting the fact that the variables are meaningful, but not observable. Other latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures. The terms hypothetical variables or hypothetical constructs may be used in these situations. The use of latent variables can serve to reduce the dimensionality of data. Many observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable "sub-symbolic" data in the real world to symbolic data in the modeled world. == Examples == === Psychology === Latent variables, as created by factor analytic methods, generally represent "shared" variance, or the degree to which variables "move" together. Variables that have no correlation cannot result in a latent construct based on the common factor model. The "Big Five personality traits" have been inferred using factor analysis. extraversion spatial ability wisdom: “Two of the more predominant means of assessing wisdom include wisdom-related performance and latent variable measures.” Spearman's g, or the general intelligence factor in psychometrics === Economics === Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly. However, by linking these latent variables to other, observable variables, the values of the latent variables can be inferred from measurements of the observable variables. Quality of life is a latent variable which cannot be measured directly, so observable variables are used to infer quality of life. Observable variables to measure quality of life include wealth, employment, environment, physical and mental health, education, recreation and leisure time, and social belonging. === Medicine === Latent-variable methodology is used in many branches of medicine. A class of problems that naturally lend themselves to latent variables approaches are longitudinal studies where the time scale (e.g. age of participant or time since study baseline) is not synchronized with the trait being studied. For such studies, an unobserved time scale that is synchronized with the trait being studied can be modeled as a transformation of the observed time scale using latent variables. Examples of this include disease progression modeling and modeling of growth (see box). == Inferring latent variables == There exists a range of different model classes and methodology that make use of latent variables and allow inference in the presence of latent variables. Models include: linear mixed-effects models and nonlinear mixed-effects models Hidden Markov models Factor analysis Item response theory Analysis and inference methods include: Principal component analysis Instrumented principal component analysis Partial least squares regression Latent semantic analysis and probabilistic latent semantic analysis EM algorithms Metropolis–Hastings algorithm === Bayesian algorithms and methods === Bayesian statistics is often used for inferring latent variables. Latent Dirichlet allocation The Chinese restaurant process is often used to provide a prior distribution over assignments of objects to latent categories. The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.

    Read more →
  • Pattern theory

    Pattern theory

    Pattern theory, formulated by Ulf Grenander, is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language. Broad in its mathematical coverage, Pattern Theory spans algebra and statistics, as well as local topological and global entropic properties. In addition to the new algebraic vocabulary, its statistical approach is novel in its aim to: Identify the hidden variables of a data set using real world data rather than artificial stimuli, which was previously commonplace. Formulate prior distributions for hidden variables and models for the observed variables that form the vertices of a Gibbs-like graph. Study the randomness and variability of these graphs. Create the basic classes of stochastic models applied by listing the deformations of the patterns. Synthesize (sample) from the models, not just analyze signals with them. The Brown University Pattern Theory Group was formed in 1972 by Ulf Grenander. Many mathematicians are currently working in this group, noteworthy among them being the Fields Medalist David Mumford. Mumford regards Grenander as his "guru" in Pattern Theory.

    Read more →
  • Variable-order Bayesian network

    Variable-order Bayesian network

    Variable-order Bayesian network (VOBN) models provide an important extension of both the Bayesian network models and the variable-order Markov models. VOBN models are used in machine learning in general and have shown great potential in bioinformatics applications. These models extend the widely used position weight matrix (PWM) models, Markov models, and Bayesian network (BN) models. In contrast to the BN models, where each random variable depends on a fixed subset of random variables, in VOBN models these subsets may vary based on the specific realization of observed variables. The observed realizations are often called the context and, hence, VOBN models are also known as context-specific Bayesian networks. The flexibility in the definition of conditioning subsets of variables turns out to be a real advantage in classification and analysis applications, as the statistical dependencies between random variables in a sequence of variables (not necessarily adjacent) may be taken into account efficiently, and in a position-specific and context-specific manner.

    Read more →
  • Multiclass classification

    Multiclass classification

    In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes (banana, peach, orange, apple), while deciding on whether an image contains an apple or not is a binary classification problem (with the two possible classes being: apple, no apple). While many classification algorithms (e.g., decision trees, k-NN, neural networks and multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms (e.g., classical binary support vector machine) and require decomposition strategies such as one-vs-all, one-vs-one, or ECOC to solve multiclass problems. Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance (e.g., predicting that an image contains both an apple and an orange, in the previous example). == Better-than-random multiclass models == From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K ≥ 3 {\displaystyle K\geq 3} be the number of classes, O {\displaystyle {\mathcal {O}}} a set of observations, y ^ : O → { 1 , . . . , K } {\displaystyle {\hat {y}}:{\mathcal {O}}\to \{1,...,K\}} a model of the target variable y : O → { 1 , . . . , K } {\displaystyle y:{\mathcal {O}}\to \{1,...,K\}} and n i , j {\displaystyle n_{i,j}} be the number of observations in the set { y = i } ∩ { y ^ = j } {\displaystyle \{y=i\}\cap \{{\hat {y}}=j\}} . We note n i . = ∑ j n i , j {\displaystyle n_{i.}=\sum _{j}n_{i,j}} , n . j = ∑ i n i , j {\displaystyle n_{.j}=\sum _{i}n_{i,j}} , n = ∑ j n . j = ∑ i n i . {\displaystyle n=\sum _{j}n_{.j}=\sum _{i}n_{i.}} , λ i = n i . n {\displaystyle \lambda _{i}={\frac {n_{i.}}{n}}} and μ j = n . j n {\displaystyle \mu _{j}={\frac {n_{.j}}{n}}} . It is assumed that the confusion matrix ( n i , j ) i , j {\displaystyle (n_{i,j})_{i,j}} contains at least one non-zero entry in each row, that is λ i > 0 {\displaystyle \lambda _{i}>0} for any i {\displaystyle i} . Finally we call "normalized confusion matrix" the matrix of conditional probabilities ( P ( y ^ = j ∣ y = i ) ) i , j = ( n i , j n i . ) i , j {\displaystyle (\mathbb {P} ({\hat {y}}=j\mid y=i))_{i,j}=\left({\frac {n_{i,j}}{n_{i.}}}\right)_{i,j}} . === Intuitive explanation === The lift is a way of measuring the deviation from independence of two events A {\displaystyle A} and B {\displaystyle B} : L i f t ( A , B ) = P ( A ∩ B ) P ( A ) P ( B ) = P ( A ∣ B ) P ( A ) = P ( B ∣ A ) P ( B ) {\displaystyle \mathrm {Lift} (A,B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (A)\mathbb {P} (B)}}={\frac {\mathbb {P} (A\mid B)}{\mathbb {P} (A)}}={\frac {\mathbb {P} (B\mid A)}{\mathbb {P} (B)}}} We have L i f t ( A , B ) > 1 {\displaystyle \mathrm {Lift} (A,B)>1} if and only if events A {\displaystyle A} and B {\displaystyle B} occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} . And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R i {\displaystyle R_{i}} of the confusion matrix by a constant c i {\displaystyle c_{i}} . Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i {\displaystyle i} , we define the binary target variable y i {\displaystyle y_{i}} which is the indicator of event { y = i } {\displaystyle \{y=i\}} , and the binary model y ^ i {\displaystyle {\hat {y}}_{i}} of y i {\displaystyle y_{i}} which is the indicator of event { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} . Each of the y ^ i {\displaystyle {\hat {y}}_{i}} models is a "One versus Rest" model. L i f t ( y = i , y ^ = i ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)} only depends on the events { y = i } {\displaystyle \{y=i\}} and { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} , so merging or not merging the other classes doesn't change its value. We therefore have L i f t ( y = i , y ^ = i ) = L i f t ( y i = 1 , y ^ i = 1 ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)=\mathrm {Lift} (y_{i}=1,{\hat {y}}_{i}=1)} and the first condition is that all binary One versus Rest models are better than chance. ==== Example ==== If K = 2 {\displaystyle K=2} and 2 is the class of interest , the normalized confusion matrix is ( s p e c i f i c i t y 1 − s p e c i f i c i t y 1 − s e n s i t i v i t y s e n s i t i v i t y ) {\displaystyle {\begin{pmatrix}\mathrm {specificity} &1-\mathrm {specificity} \\1-\mathrm {sensitivity} &\mathrm {sensitivity} \end{pmatrix}}} and we have L i f t ( y = 1 , y ^ = 1 ) − 1 = P ( y = y ^ = 1 ) λ 1 μ 1 − 1 = n 1 , 1 n n 1. n .1 − 1 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)-1={\frac {\mathbb {P} (y={\hat {y}}=1)}{\lambda _{1}\mu _{1}}}-1={\frac {n_{1,1}n}{n_{1.}n_{.1}}}-1} = n 1 , 1 ( n 1 , 1 + n 1 , 2 + n 2 , 1 + n 2 , 2 ) − ( n 1 , 1 + n 1 , 2 ) ( n 1 , 1 + n 2 , 1 ) n 1. n .1 = n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 n 1. n .1 {\displaystyle ={\frac {n_{1,1}(n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2})-(n_{1,1}+n_{1,2})(n_{1,1}+n_{2,1})}{n_{1.}n_{.1}}}={\frac {n_{1,1}n_{2,2}-n_{1,2}n_{2,1}}{n_{1.}n_{.1}}}} . Thus L i f t ( y = 1 , y ^ = 1 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Similarly, by swapping the roles of 1 and 2, we find that L i f t ( y = 2 , y ^ = 2 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=2,{\hat {y}}=2)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Dividing by n 1. n 2. {\displaystyle n_{1.}n_{2.}} we find that the necessary and sufficient condition on the normalized confusion matrix is s e n s i t i v i t y s p e c i f i c i t y − ( 1 − s e n s i t i v i t y ) ( 1 − s p e c i f i c i t y ) ≥ 0 ⟺ s e n s i t i v i t y + s p e c i f i c i t y − 1 ≥ 0 ⟺ J ≥ 0 {\displaystyle \mathrm {sensitivity} \ \mathrm {specificity} -(1-\mathrm {sensitivity} )(1-\mathrm {specificity} )\geq 0\iff \mathrm {sensitivity} +\mathrm {specificity} -1\geq 0\iff J\geq 0} . This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). === Random models === A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. This proposition shows that the model y ^ {\displaystyle {\hat {y}}} of y {\displaystyle y} is uninformative if and only if there are two families of numbers ( α i ) i {\displaystyle (\alpha _{i})_{i}} and ( β j ) j {\displaystyle (\beta _{j})_{j}} such that P ( { y = i } ∩ { y ^ = j } ) = α i β j {\displaystyle \mathbb {P} (\{y=i\}\cap \{{\hat {y}}=j\})=\alpha _{i}\beta _{j}} for any i {\displaystyle i} and j {\displaystyle j} . === Multiclass likelihood ratios and diagnostic odds ratios === We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i {\displaystyle i} and j ≠ i {\displaystyle j\not =i} , let L R i , j = P ( y ^ = j ∣ y = j ) P ( y ^ = j ∣ y = i ) {\displaystyle \mathrm {LR} _{i,j}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)}}} . When K = 2 {\displaystyle K=2} , if 2 is the class of interest,, we find the classical likelihood ratios L R 1 , 2 = L R + {\displaystyle \mathrm {LR} _{1,2}=\mathrm {LR} _{+}} and L R 2 , 1 = 1 L R − {\displaystyle \mathrm {LR} _{2,1}={\frac {1}{\mathrm {LR} _{-}}}} . Multiclass diagnostic odds ratios can also be defined using the formula D O R i , j = D O R j , i = L R i , j L R j , i = n i , i n j , j n i , j n j , i = P ( y ^ = j ∣ y = j ) / P ( y ^ = i ∣ y = j ) P ( y ^ = j ∣ y = i ) / P ( y ^ = i ∣ y = i ) {\displaystyle \mathrm {DOR} _{i,j}=\mathrm {DOR} _{j,i}=\mathrm {LR} _{i,j}\mathrm {LR} _{j,i}={\frac {n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)/\mathbb {P} ({\hat {y}}=i\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)/\mathbb {P} ({\hat {y}}=i\mid y=i)}}} We saw above that a better-than-chance model (or a random model) must verify L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} and λ i {\displaystyle \lambda _{i}} . According to the previous corollary, likelihood ratios are thus greater

    Read more →