AI Chat Sites

AI Chat Sites — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Semantic decomposition (natural language processing)

    Semantic decomposition (natural language processing)

    A semantic decomposition is an algorithm that breaks down the meanings of phrases or concepts into less complex concepts. The result of a semantic decomposition is a representation of meaning. This representation can be used for tasks, such as those related to artificial intelligence or machine learning. Semantic decomposition is common in natural language processing applications. The basic idea of a semantic decomposition is taken from the learning skills of adult humans, where words are explained using other words. It is based on Meaning-text theory. Meaning-text theory is used as a theoretical linguistic framework to describe the meaning of concepts with other concepts. == Background == Given that an AI does not inherently have language, it is unable to think about the meanings behind the words of a language. An artificial notion of meaning needs to be created for a strong AI to emerge. Creating an artificial representation of meaning requires the analysis of what meaning is. Many terms are associated with meaning, including semantics, pragmatics, knowledge and understanding or word sense. Each term describes a particular aspect of meaning, and contributes to a multitude of theories explaining what meaning is. These theories need to be analyzed further to develop an artificial notion of meaning best fit for our current state of knowledge. == Graph representations == Representing meaning as a graph is one of the two ways that both an AI cognition and a linguistic researcher think about meaning (connectionist view). Logicians utilize a formal representation of meaning to build upon the idea of symbolic representation, whereas description logics describe languages and the meaning of symbols. This contention between 'neat' and 'scruffy' techniques has been discussed since the 1970s. Research has so far identified semantic measures and with that word-sense disambiguation (WSD) - the differentiation of meaning of words - as the main problem of language understanding. As an AI-complete environment, WSD is a core problem of natural language understanding. AI approaches that use knowledge-given reasoning creates a notion of meaning combining the state of the art knowledge of natural meaning with the symbolic and connectionist formalization of meaning for AI. The abstract approach is shown in Figure. First, a connectionist knowledge representation is created as a semantic network consisting of concepts and their relations to serve as the basis for the representation of meaning. This graph is built out of different knowledge sources like WordNet, Wiktionary, and BabelNET. The graph is created by lexical decomposition that recursively breaks each concept semantically down into a set of semantic primes. The primes are taken from the theory of Natural Semantic Metalanguage, which has been analyzed for usefulness in formal languages. Upon this graph marker passing is used to create the dynamic part of meaning representing thoughts. The marker passing algorithm, where symbolic information is passed along relations form one concept to another, uses node and edge interpretation to guide its markers. The node and edge interpretation model is the symbolic influence of certain concepts. Future work uses the created representation of meaning to build heuristics and evaluate them through capability matching and agent planning, chatbots or other applications of natural language understanding.

    Read more →
  • Multiclass classification

    Multiclass classification

    In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes (banana, peach, orange, apple), while deciding on whether an image contains an apple or not is a binary classification problem (with the two possible classes being: apple, no apple). While many classification algorithms (e.g., decision trees, k-NN, neural networks and multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms (e.g., classical binary support vector machine) and require decomposition strategies such as one-vs-all, one-vs-one, or ECOC to solve multiclass problems. Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance (e.g., predicting that an image contains both an apple and an orange, in the previous example). == Better-than-random multiclass models == From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K ≥ 3 {\displaystyle K\geq 3} be the number of classes, O {\displaystyle {\mathcal {O}}} a set of observations, y ^ : O → { 1 , . . . , K } {\displaystyle {\hat {y}}:{\mathcal {O}}\to \{1,...,K\}} a model of the target variable y : O → { 1 , . . . , K } {\displaystyle y:{\mathcal {O}}\to \{1,...,K\}} and n i , j {\displaystyle n_{i,j}} be the number of observations in the set { y = i } ∩ { y ^ = j } {\displaystyle \{y=i\}\cap \{{\hat {y}}=j\}} . We note n i . = ∑ j n i , j {\displaystyle n_{i.}=\sum _{j}n_{i,j}} , n . j = ∑ i n i , j {\displaystyle n_{.j}=\sum _{i}n_{i,j}} , n = ∑ j n . j = ∑ i n i . {\displaystyle n=\sum _{j}n_{.j}=\sum _{i}n_{i.}} , λ i = n i . n {\displaystyle \lambda _{i}={\frac {n_{i.}}{n}}} and μ j = n . j n {\displaystyle \mu _{j}={\frac {n_{.j}}{n}}} . It is assumed that the confusion matrix ( n i , j ) i , j {\displaystyle (n_{i,j})_{i,j}} contains at least one non-zero entry in each row, that is λ i > 0 {\displaystyle \lambda _{i}>0} for any i {\displaystyle i} . Finally we call "normalized confusion matrix" the matrix of conditional probabilities ( P ( y ^ = j ∣ y = i ) ) i , j = ( n i , j n i . ) i , j {\displaystyle (\mathbb {P} ({\hat {y}}=j\mid y=i))_{i,j}=\left({\frac {n_{i,j}}{n_{i.}}}\right)_{i,j}} . === Intuitive explanation === The lift is a way of measuring the deviation from independence of two events A {\displaystyle A} and B {\displaystyle B} : L i f t ( A , B ) = P ( A ∩ B ) P ( A ) P ( B ) = P ( A ∣ B ) P ( A ) = P ( B ∣ A ) P ( B ) {\displaystyle \mathrm {Lift} (A,B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (A)\mathbb {P} (B)}}={\frac {\mathbb {P} (A\mid B)}{\mathbb {P} (A)}}={\frac {\mathbb {P} (B\mid A)}{\mathbb {P} (B)}}} We have L i f t ( A , B ) > 1 {\displaystyle \mathrm {Lift} (A,B)>1} if and only if events A {\displaystyle A} and B {\displaystyle B} occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} . And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R i {\displaystyle R_{i}} of the confusion matrix by a constant c i {\displaystyle c_{i}} . Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i {\displaystyle i} , we define the binary target variable y i {\displaystyle y_{i}} which is the indicator of event { y = i } {\displaystyle \{y=i\}} , and the binary model y ^ i {\displaystyle {\hat {y}}_{i}} of y i {\displaystyle y_{i}} which is the indicator of event { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} . Each of the y ^ i {\displaystyle {\hat {y}}_{i}} models is a "One versus Rest" model. L i f t ( y = i , y ^ = i ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)} only depends on the events { y = i } {\displaystyle \{y=i\}} and { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} , so merging or not merging the other classes doesn't change its value. We therefore have L i f t ( y = i , y ^ = i ) = L i f t ( y i = 1 , y ^ i = 1 ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)=\mathrm {Lift} (y_{i}=1,{\hat {y}}_{i}=1)} and the first condition is that all binary One versus Rest models are better than chance. ==== Example ==== If K = 2 {\displaystyle K=2} and 2 is the class of interest , the normalized confusion matrix is ( s p e c i f i c i t y 1 − s p e c i f i c i t y 1 − s e n s i t i v i t y s e n s i t i v i t y ) {\displaystyle {\begin{pmatrix}\mathrm {specificity} &1-\mathrm {specificity} \\1-\mathrm {sensitivity} &\mathrm {sensitivity} \end{pmatrix}}} and we have L i f t ( y = 1 , y ^ = 1 ) − 1 = P ( y = y ^ = 1 ) λ 1 μ 1 − 1 = n 1 , 1 n n 1. n .1 − 1 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)-1={\frac {\mathbb {P} (y={\hat {y}}=1)}{\lambda _{1}\mu _{1}}}-1={\frac {n_{1,1}n}{n_{1.}n_{.1}}}-1} = n 1 , 1 ( n 1 , 1 + n 1 , 2 + n 2 , 1 + n 2 , 2 ) − ( n 1 , 1 + n 1 , 2 ) ( n 1 , 1 + n 2 , 1 ) n 1. n .1 = n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 n 1. n .1 {\displaystyle ={\frac {n_{1,1}(n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2})-(n_{1,1}+n_{1,2})(n_{1,1}+n_{2,1})}{n_{1.}n_{.1}}}={\frac {n_{1,1}n_{2,2}-n_{1,2}n_{2,1}}{n_{1.}n_{.1}}}} . Thus L i f t ( y = 1 , y ^ = 1 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Similarly, by swapping the roles of 1 and 2, we find that L i f t ( y = 2 , y ^ = 2 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=2,{\hat {y}}=2)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Dividing by n 1. n 2. {\displaystyle n_{1.}n_{2.}} we find that the necessary and sufficient condition on the normalized confusion matrix is s e n s i t i v i t y s p e c i f i c i t y − ( 1 − s e n s i t i v i t y ) ( 1 − s p e c i f i c i t y ) ≥ 0 ⟺ s e n s i t i v i t y + s p e c i f i c i t y − 1 ≥ 0 ⟺ J ≥ 0 {\displaystyle \mathrm {sensitivity} \ \mathrm {specificity} -(1-\mathrm {sensitivity} )(1-\mathrm {specificity} )\geq 0\iff \mathrm {sensitivity} +\mathrm {specificity} -1\geq 0\iff J\geq 0} . This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). === Random models === A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. This proposition shows that the model y ^ {\displaystyle {\hat {y}}} of y {\displaystyle y} is uninformative if and only if there are two families of numbers ( α i ) i {\displaystyle (\alpha _{i})_{i}} and ( β j ) j {\displaystyle (\beta _{j})_{j}} such that P ( { y = i } ∩ { y ^ = j } ) = α i β j {\displaystyle \mathbb {P} (\{y=i\}\cap \{{\hat {y}}=j\})=\alpha _{i}\beta _{j}} for any i {\displaystyle i} and j {\displaystyle j} . === Multiclass likelihood ratios and diagnostic odds ratios === We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i {\displaystyle i} and j ≠ i {\displaystyle j\not =i} , let L R i , j = P ( y ^ = j ∣ y = j ) P ( y ^ = j ∣ y = i ) {\displaystyle \mathrm {LR} _{i,j}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)}}} . When K = 2 {\displaystyle K=2} , if 2 is the class of interest,, we find the classical likelihood ratios L R 1 , 2 = L R + {\displaystyle \mathrm {LR} _{1,2}=\mathrm {LR} _{+}} and L R 2 , 1 = 1 L R − {\displaystyle \mathrm {LR} _{2,1}={\frac {1}{\mathrm {LR} _{-}}}} . Multiclass diagnostic odds ratios can also be defined using the formula D O R i , j = D O R j , i = L R i , j L R j , i = n i , i n j , j n i , j n j , i = P ( y ^ = j ∣ y = j ) / P ( y ^ = i ∣ y = j ) P ( y ^ = j ∣ y = i ) / P ( y ^ = i ∣ y = i ) {\displaystyle \mathrm {DOR} _{i,j}=\mathrm {DOR} _{j,i}=\mathrm {LR} _{i,j}\mathrm {LR} _{j,i}={\frac {n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)/\mathbb {P} ({\hat {y}}=i\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)/\mathbb {P} ({\hat {y}}=i\mid y=i)}}} We saw above that a better-than-chance model (or a random model) must verify L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} and λ i {\displaystyle \lambda _{i}} . According to the previous corollary, likelihood ratios are thus greater

    Read more →
  • Ordination (statistics)

    Ordination (statistics)

    Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

    Read more →
  • Quantum neural network

    Quantum neural network

    Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models (which are widely used in machine learning for the important task of pattern recognition) with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments. Most Quantum neural networks are developed as feed-forward networks. Similar to their classical counterparts, this structure intakes input from one layer of qubits, and passes that input onto another layer of qubits. This layer of qubits evaluates this information and passes on the output to the next layer. Eventually the path leads to the final layer of qubits. The layers do not have to be of the same width, meaning they don't have to have the same number of qubits as the layer before or after it. This structure is trained on which path to take similar to classical artificial neural networks. This is discussed in a lower section. Quantum neural networks refer to three different categories: Quantum computer with classical data, classical computer with quantum data, and quantum computer with quantum data. == Examples == Quantum neural network research is still in its infancy, and a conglomeration of proposals and ideas of varying scope and mathematical rigor have been put forward. Most of them are based on the idea of replacing classical binary or McCulloch-Pitts neurons with a qubit (which can be called a "quron"), resulting in neural units that can be in a superposition of the state 'firing' and 'resting'. === Quantum perceptrons === A lot of proposals attempt to find a quantum equivalent for the perceptron unit from which neural nets are constructed. A problem is that nonlinear activation functions do not immediately correspond to the mathematical structure of quantum theory, since a quantum evolution is described by linear operations and leads to probabilistic observation. Ideas to imitate the perceptron activation function with a quantum mechanical formalism reach from special measurements to postulating non-linear quantum operators (a mathematical framework that is disputed). A direct implementation of the activation function using the circuit-based model of quantum computation has recently been proposed by Schuld, Sinayskiy and Petruccione based on the quantum phase estimation algorithm. === Quantum networks === At a larger scale, researchers have attempted to generalize neural networks to the quantum setting. One way of constructing a quantum neuron is to first generalise classical neurons and then generalising them further to make unitary gates. Interactions between neurons can be controlled quantumly, with unitary gates, or classically, via measurement of the network states. This high-level theoretical technique can be applied broadly, by taking different types of networks and different implementations of quantum neurons, such as photonically implemented neurons and quantum reservoir processor (quantum version of reservoir computing). Most learning algorithms follow the classical model of training an artificial neural network to learn the input-output function of a given training set and use classical feedback loops to update parameters of the quantum system until they converge to an optimal configuration. Learning as a parameter optimisation problem has also been approached by adiabatic models of quantum computing. Quantum neural networks can be applied to algorithmic design: given qubits with tunable mutual interactions, one can attempt to learn interactions following the classical backpropagation rule from a training set of desired input-output relations, taken to be the desired output algorithm's behavior. The quantum network thus 'learns' an algorithm. === Quantum associative memory === The first quantum associative memory algorithm was introduced by Dan Ventura and Tony Martinez in 1999. The authors do not attempt to translate the structure of artificial neural network models into quantum theory, but propose an algorithm for a circuit-based quantum computer that simulates associative memory. The memory states (in Hopfield neural networks saved in the weights of the neural connections) are written into a superposition, and a Grover-like quantum search algorithm retrieves the memory state closest to a given input. As such, this is not a fully content-addressable memory, since only incomplete patterns can be retrieved. The first truly content-addressable quantum memory, which can retrieve patterns also from corrupted inputs, was proposed by Carlo A. Trugenberger. Both memories can store an exponential (in terms of n qubits) number of patterns but can be used only once due to the no-cloning theorem and their destruction upon measurement. Trugenberger, however, has shown that his probabilistic model of quantum associative memory can be efficiently implemented and re-used multiples times for any polynomial number of stored patterns, a large advantage with respect to classical associative memories. === Classical neural networks inspired by quantum theory === A substantial amount of interest has been given to a "quantum-inspired" model that uses ideas from quantum theory to implement a neural network based on fuzzy logic. == Training == Quantum Neural Networks can be theoretically trained similarly to training classical/artificial neural networks. A key difference lies in communication between the layers of a neural networks. For classical neural networks, at the end of a given operation, the current perceptron copies its output to the next layer of perceptron(s) in the network. However, in a quantum neural network, where each perceptron is a qubit, this would violate the no-cloning theorem. A proposed generalized solution to this is to replace the classical fan-out method with an arbitrary unitary that spreads out, but does not copy, the output of one qubit to the next layer of qubits. Using this fan-out Unitary ( U f {\displaystyle U_{f}} ) with a dummy state qubit in a known state (Ex. | 0 ⟩ {\displaystyle |0\rangle } in the computational basis), also known as an Ancilla bit, the information from the qubit can be transferred to the next layer of qubits. This process adheres to the quantum operation requirement of reversibility. Using this quantum feed-forward network, deep neural networks can be executed and trained efficiently. A deep neural network is essentially a network with many hidden-layers, as seen in the sample model neural network above. Since the Quantum neural network being discussed uses fan-out Unitary operators, and each operator only acts on its respective input, only two layers are used at any given time. In other words, no Unitary operator is acting on the entire network at any given time, meaning the number of qubits required for a given step depends on the number of inputs in a given layer. Since Quantum Computers are notorious for their ability to run multiple iterations in a short period of time, the efficiency of a quantum neural network is solely dependent on the number of qubits in any given layer, and not on the depth of the network. === Cost functions === To determine the effectiveness of a neural network, a cost function is used, which essentially measures the proximity of the network's output to the expected or desired output. In a Classical Neural Network, the weights ( w {\displaystyle w} ) and biases ( b {\displaystyle b} ) at each step determine the outcome of the cost function C ( w , b ) {\displaystyle C(w,b)} . When training a Classical Neural network, the weights and biases are adjusted after each iteration, and given equation 1 below, where y ( x ) {\displaystyle y(x)} is the desired output and a out ( x ) {\displaystyle a^{\text{out}}(x)} is the actual output, the cost function is optimized when C ( w , b ) {\displaystyle C(w,b)} = 0. For a quantum neural network, the cost function is determined by measuring the fidelity of the outcome state ( ρ out {\displaystyle \rho ^{\text{out}}} ) with the desired outcome state ( ϕ out {\displaystyle \phi ^{\text{out}}} ), seen in Equation 2 below. In this case, the Unitary operators are adjusted after each it

    Read more →
  • Sarpa (snakebite app)

    Sarpa (snakebite app)

    Sarpa or SARPA (Snake Awareness, Rescue and Protection app) is a snakebite app, an application for mobile devices developed in India to provide rapid, life-saving help for victims of snakebite, which kill an estimated 58,000 people a year in India. The app provides information about snakes, gets fast aid for people bitten, and helps in the development of antivenoms. Similar systems developed in India include SnakeHub, Snake Lens, Snakepedia, Serpent and the Big Four Mapping Project. The apps provide rapid response to snakebite incidents, often in remote areas, using a network of volunteers managed by local wildlife departments; their use can save human lives by providing rapid medical care, and also snakes, by helping to avoid interaction between the species. In 2026, it was announced that the app had plans to offer real-time contact from doctors directly from the app to provide users with decision-making advice.

    Read more →
  • Conference on Computer Vision and Pattern Recognition

    Conference on Computer Vision and Pattern Recognition

    The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.

    Read more →
  • Prescription monitoring program

    Prescription monitoring program

    In the United States, prescription monitoring programs (PMPs) or prescription drug monitoring programs (PDMPs) are state-run programs which collect and distribute data about the prescription and dispensation of federally controlled substances and, depending on state requirements, other potentially abusable prescription drugs. PMPs are meant to help prevent adverse drug-related events such as opioid overdoses, drug diversion, and substance abuse by decreasing the amount and/or frequency of opioid prescribing, and by identifying those patients who are obtaining prescriptions from multiple providers (i.e., "doctor shopping") or those physicians overprescribing opioids. Most US health care workers support the idea of PMPs, which intend to assist physicians, physician assistants, nurse practitioners, dentists and other prescribers, the pharmacists, chemists and support staff of dispensing establishments. The database, whose use is required by State law, typically requires prescribers and pharmacies dispensing controlled substances to register with their respective state PMPs and (for pharmacies and providers who dispense from their offices) to report the dispensation of such prescriptions to an electronic online database. The majority of PMPs are authorized to notify law enforcement agencies or licensing boards or physicians when a prescriber, or patients receiving prescriptions, exceed thresholds established by the state or prescription recipient exceeds thresholds established by the State. All states have implemented PDMPs, although evidence for the effectiveness of these programs is mixed. While prescription of opioids has decreased with PMP use, overdose deaths in many states have actually increased, with those states sharing data with neighboring jurisdictions or requiring reporting of more drugs experiencing highest increases in deaths. This may be because those declined opioid prescriptions turn to street drugs, whose potency and contaminants carry greater overdose risk. == History == Prescription drug monitoring programs, or PDMPs, are an example of one initiative proposed to alleviate effects of the opioid crisis. The programs are designed to restrict prescription drug abuse by limiting a patient's ability to obtain similar prescriptions from multiple providers (i.e. “doctor shopping”) and reducing diversion of controlled substances. This is meant to reduce risk of fatal overdose caused by high doses of opioids or interactions between opioids and benzodiazepenes, and to enable better decision making on the part of healthcare providers who may be unaware of a patient's prescription drug use, history or other prescriptions. PDMPs have been implemented in state legislations since 1939 in California, a time before electronic medical records, though implementation rose alongside increased awareness of overprescribing of opioids and overdose. A later New York state program was struck down by the U.S. Supreme Court in Whalen v. Roe. But, by 2019, 49 states, the District of Columbia, and Guam had enacted PDMP legislation. In 2021 Missouri, the last State to not use a PMP, adopted legislation to create one. PMPs are constantly being updated to increase speed of data collection, sharing of data across States, and ease of interpretation. This is being done by integrating PDMP reports with other health information technologies such as health information exchanges (HIE), electronic health record (EHR) systems, and/ or pharmacy dispensing software systems. One program that has been implemented in nine states is called the PDMP Electronic Health Records Integration and Interoperability Expansion, also known as PEHRIIE. Another software, marketed by Bamboo Health and integrated with PMPs in 43 states, uses an algorithm to track factors thought to increase risk of diversion, abuse or overdose, and assigns patients a three digit score based on presumed indicators of risk. While some studies have suggested that PDMP-HIT integration and sharing of interstate data brings benefits such as reduced opioid-related inpatient morbidity, others have found no or negative impact on mortality compared to states without PMP data sharing. Patient and media reports suggest need for testing and evaluation of algorithmic software used to score risk, with some patients reporting denial of prescriptions without c explanation or clarity of data. == Goals == Most health care workers support PMPs which intend to assist physicians, physician assistants, nurse practitioners, dentists and other prescribers, the pharmacists, chemists and support staff of dispensing establishments, as well as law-enforcement agencies. The collaboration supports the legitimate medical use of controlled substances while limiting their abuse and diversion. Pharmacies dispensing controlled substances and prescribers typically must register with their respective state PMPs and (for pharmacies and providers who dispense controlled substances from their offices) report the dispensation to an electronic online database. Some pharmacy software can submit these reports automatically to multiple states. == Usage == === List of programs by state === === Software systems === NarxCare is a prescription drug monitoring program (PDMP) run by Bamboo Health. Bamboo Health was formerly known as Appriss. It is widely used across the United States by pharmacies including Rite Aid as well as those at Walmart and Sam’s Club. The NarxCare software allows doctors to view data about a patient, combining data from the prescription registries of various U.S. states to make the registries interoperable nationally. It also uses machine learning to generate an "Overdose Risk Score" that potentially includes EMS and criminal justice data; these scores have been criticized by researchers and patient advocates for the lack of transparency in the process as well as the potential for disparate treatment of women and minority groups. Advertised as an "analytics tool and care management platform", the NarxCare software allows doctors to view data about a patient including how many pharmacies they have visited and the combinations of medication they are prescribed. It combines data from the prescription registries of various U.S. states, making the registries interoperable nationally. It additionally uses machine learning to generate various three-digit "risk scores" and an overall "Overdose Risk Score", collectively referred to as Narx Scores, in a process that potentially includes EMS and criminal justice data as well as court records. == Controversy == Many doctors and researchers support the idea of PDMPs as a tool in combatting the opioid epidemic. Opioid prescribing, opioid diversion and supply, opioid misuse, and opioid-related morbidity and mortality are common elements in data entered into PDMPs. Prescription Monitoring Programs are purported to offer economic benefits for the states who implement them by decreasing overall health care costs, lost productivity, and investigation times. However, there are many studies that conclude the impact of PDMPs is unclear. While use of PMPs has been accompanied by decrease in opioid prescribing, few analyses consider corresponding use of street opioids, extramedical use, or diversion, which might provide a more holistic method for evaluation of PMP intent and efficacy. Evidence for PDMP impact on fatal overdoses is decidedly mixed, with multiple studies finding increased overdose rates in some states, decreases in others, or no clear impact. Interestingly, an increase in heroin overdoses after PDMP implementation has been commonly reported, presumably as denial of prescription opioids sends patients in search of street drugs. Narx Scores have been criticized by researchers and patient advocates for the lack of transparency in the generation process as well as the potential for disparate treatment of women and minority groups. Writing in Duke Law Journal, Jennifer Oliva stated that "black-box algorithms" are used to generate the scores.

    Read more →
  • Independent component analysis

    Independent component analysis

    In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room. == Introduction == Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the numerical addition, at each time t, of signals from several sources. The question then is whether it is possible to separate these contributing sources from the observed total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results. It is also used for signals that are not supposed to be generated by mixing for analysis purposes. A simple application of ICA is the "cocktail party problem", where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated. Mixing weights for constructing the M {\textstyle M} observed signals from the N {\textstyle N} components can be placed in an M × N {\textstyle M\times N} matrix. An important thing to consider is that if N {\textstyle N} sources are present, at least N {\textstyle N} observations (e.g. microphones if the observed signal is audio) are needed to recover the original signals. When there are an equal number of observations and source signals, the mixing matrix is square ( M = N {\textstyle M=N} ). Other cases of underdetermined ( M < N {\textstyle M N {\textstyle M>N} ) have been investigated. The success of ICA separation of mixed signals relies on two assumptions and three effects of mixing source signals. Two assumptions: The source signals are independent of each other. The values in each source signal have non-Gaussian distributions. Three effects of mixing source signals: Independence: As per assumption 1, the source signals are independent; however, their signal mixtures are not. This is because the signal mixtures share the same source signals. Normality: According to the Central Limit Theorem, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution.Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables. Here we consider the value of each signal as the random variable. Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal. Those principles contribute to the basic establishment of ICA. If the signals extracted from a set of mixtures are independent and have non-Gaussian distributions or have low complexity, then they must be source signals. Another common example is image steganography, where ICA is used to embed one image within another. For instance, two grayscale images can be linearly combined to create mixed images in which the hidden content is visually imperceptible. ICA can then be used to recover the original source images from the mixtures. This technique underlies digital watermarking, which allows the embedding of ownership information into images, as well as more covert applications such as undetected information transmission. The method has even been linked to real-world cyberespionage cases. In such applications, ICA serves to unmix the data based on statistical independence, making it possible to extract hidden components that are not apparent in the observed data. Steganographic techniques, including those potentially involving ICA-based analysis, have been used in real-world cyberespionage cases. In 2010, the FBI uncovered a Russian spy network known as the "Illegals Program" (Operation Ghost Stories), where agents used custom-built steganography tools to conceal encrypted text messages within image files shared online. In another case, a former General Electric engineer, Xiaoqing Zheng, was convicted in 2022 for economic espionage. Zheng used steganography to exfiltrate sensitive turbine technology by embedding proprietary data within image files for transfer to entities in China. == Defining component independence == ICA finds the independent components (also called factors, latent variables or sources) by maximizing the statistical independence of the estimated components. We may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The two broadest definitions of independence for ICA are Minimization of mutual information Maximization of non-Gaussianity The Minimization-of-Mutual information (MMI) family of ICA algorithms uses measures like Kullback-Leibler Divergence and maximum entropy. The non-Gaussianity family of ICA algorithms, motivated by the central limit theorem, uses kurtosis and negentropy. Typical algorithms for ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigenvalue decomposition), and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. == Mathematical definitions == Linear independent component analysis can be divided into noiseless and noisy cases, where noiseless ICA is a special case of noisy ICA. Nonlinear ICA should be considered as a separate case. === General Derivation === In the classical ICA model, it is assumed that the observed data x i ∈ R m {\displaystyle \mathbf {x} _{i}\in \mathbb {R} ^{m}} at time t i {\displaystyle t_{i}} is generated from source signals s i ∈ R m {\displaystyle \mathbf {s} _{i}\in \mathbb {R} ^{m}} via a linear transformation x i = A s i {\displaystyle \mathbf {x} _{i}=A\mathbf {s} _{i}} , where A {\displaystyle A} is an unknown, invertible mixing matrix. To recover the source signals, the data is first centered (zero mean), and then whitened so that the transformed data has unit covariance. This whitening reduces the problem from estimating a general matrix A {\displaystyle A} to estimating an orthogonal matrix V {\displaystyle V} , significantly simplifying the search for independent components. If the covariance matrix of the centered data is Σ x = A A ⊤ {\displaystyle \Sigma _{x}=AA^{\top }} , then using the eigen-decomposition Σ x = Q D Q ⊤ {\displaystyle \Sigma _{x}=QDQ^{\top }} , the whitening transformation can be taken as D − 1 / 2 Q ⊤ {\displaystyle D^{-1/2}Q^{\top }} . This step ensures that the recovered sources are uncorrelated and of unit variance, leaving only the task of rotating the whitened data to maximize statistical independence. This general derivation underlies many ICA algorithms and is foundational in understanding the ICA model. ==== Reduced Mixing Problem ==== Independent component analysis (ICA) addresses the problem of recovering a set of unobserved source signals s i = ( s i 1 , s i 2 , … , s i m ) T {\displaystyle s_{i}=(s_{i1},s_{i2},\dots ,s_{im})^{T}} from observed mixed signals x i = ( x i 1 , x i 2 , … , x i m ) T {\displaystyle x_{i}=(x_{i1},x_{i2},\dots ,x_{im})^{T}} , based on the linear mixing model: x i = A s i , {\displaystyle x_{i}=A\,s_{i},} where the A {\displaystyle A} is an m × m {\displaystyle m\times m} invertible matrix called the mixing matrix, s i {\displaystyle s_{i}} represents the m‑dimensional vector containing the values of the sources at time t i {\displaystyle t_{i}} , and x i {\displaystyle x_{i}} is the corresponding vector of observed values at time t i {\displaystyle t_{i}} . The goal is to estimate both A {\displaystyle A} and the source signals { s i } {\displaystyle \{s_{i}\}} solely from the observed data { x i } {\displaystyle \{x_{i}\}} . After centering, the Gram matrix is computed as: ( X ∗ ) T X ∗ = Q D Q T , {\displaystyle (X^{})^{T}X^{}=Q\,D\,Q^{T},} where D is a diagonal matrix with positive entries (assuming X ∗ {\displaystyle X^{}} has maximum rank), and Q is an orthogonal matrix. Writing the SVD of the mixing matrix A = U Σ V T {\displaystyle A=U\Sigma V^{T}} and comparing with A A T = U Σ 2 U T {\displaystyle AA^{T}=U\Sigma ^{2}U^{T}} the mixing A has the form A = Q D 1 / 2 V T . {\displaystyle A=Q\,D^{1/2}\,V^{T}.} So, the normalized source values satisfy s i ∗ = V y i ∗ {\displaystyle s_{i}^{}=V\,y_{i}^{}} , where y i ∗ = D − 1 2 Q T x i ∗ . {\displaystyle y_{i}^{}=D^{-{\tfrac {1}{2}}}Q^{T}x_{i}^{}.} Thus, ICA reduces

    Read more →
  • Language identification

    Language identification

    In natural language processing, language identification or language guessing is the problem of determining which natural language a given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods. == Overview == === Logical approach === A common non-statistical intuitive approach (though highly uncertain) is to look for common letter combinations, or distinctive diacritics or punctuation. === Statistical approach === There are several statistical approaches to language identification. An older statistical method by Grefenstette was based on the frequency of short n-grams, which are often function morphemes. For example, "ing" is more common in English than in French, while the sequence "que" is more common in French. Given a new page found on the Web, one counts the number of occurrences of each such short sequence and picks the language whose frequency table it matches the most. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods. Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques. Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web. As of 2025, a commonly used baseline method is via the fastText library, which has comparable classification accuracy as deep learning techniques, but much faster. == Identifying similar languages == One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them. In 2014 the DSL shared task has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014. == Software == Apache OpenNLP includes char n-gram based statistical detector and comes with a model that can distinguish 103 languages Apache Tika contains a language detector for 18 languages

    Read more →
  • Targeted maximum likelihood estimation

    Targeted maximum likelihood estimation

    Targeted Maximum Likelihood Estimation (TMLE) (also more accurately referred to as Targeted Minimum Loss-Based Estimation) is a general statistical estimation framework for causal inference and semiparametric models. TMLE combines ideas from maximum likelihood estimation, semiparametric efficiency theory, and machine learning. It was introduced by Mark J. van der Laan and colleagues in the mid-2000s as a method that yields asymptotically efficient plug-in estimators while allowing the use of flexible, data-adaptive algorithms such as ensemble machine learning for nuisance parameter estimation. TMLE is used in epidemiology, biostatistics, and the social sciences to estimate causal effects in observational and experimental studies. Applications of TMLE include Longitudinal TMLE (LTMLE) for time-varying treatments and confounders. Variations in how the targeting step in TMLE is carried out have resulted in various versions of TMLE such as Collaborative TMLE (CTMLE) and Adaptive TMLE for improved finite-sample performance and automated variable selection. == History == The TMLE framework was first described by van der Laan and Rubin (2006) as a general approach for the construction of efficient plug-in estimators of smooth features of the data density. It was demonstrated in the context of causal inference and missing data problems. It was developed to address limitations of traditional doubly robust methods, such as Augmented Inverse Probability Weighting (AIPW), by respecting the plug-in principle in the sense that it respects that the target parameter is a function of the data density that is an element of the statistical model. TMLE estimates the data density or relevant parts of it with machine learning and targets these machine learning fits before it is plugged in the target parameter mapping. In this manner, a TMLE always respects global knowledge and satisfies known bounds such as that the target parameter is a probability . Since its introduction, TMLE has been developed in a series of theoretical and applied papers, culminating in book-length treatments of the method and its applications to survival analysis, adaptive designs, and longitudinal data. == Methodology == At its core, TMLE is a two-step estimation procedure: Initial estimation: Machine learning methods (such as the Super Learner ensemble) are used to obtain flexible estimates of nuisance parameters, such as outcome regressions and propensity scores. Targeting step: The initial estimate is updated by solving a score equation (the efficient influence function) so that the final estimator is consistent, asymptotically normal, and efficient under mild regularity conditions. The targeted machine learning fit is then mapped into the corresponding estimator of the target parameter by simply plugging it in the target parameter mapping. This approach balances the bias–variance trade-off by combining data-adaptive estimation with semiparametric efficiency theory. TMLE is doubly robust, meaning it remains consistent if either the outcome model or the treatment model is consistently estimated. === Formula === Here we explain the TMLE of the average treatment effect of a binary treatment on an outcome adjusting for baseline covariates. Consider i.i.d. observations O i = ( W i , A i , Y i ) {\displaystyle O_{i}=(W_{i},A_{i},Y_{i})} from a distribution P 0 {\displaystyle P_{0}} , where W {\displaystyle W} are baseline covariates, A {\displaystyle A} is a binary treatment, and Y {\displaystyle Y} is an outcome. Let Q ¯ ( a , w ) = E [ Y ∣ A = a , W = w ] {\displaystyle {\bar {Q}}(a,w)=\mathbb {E} [Y\mid A=a,W=w]} represent the outcome model and g ( a ∣ w ) = P ( A = a ∣ W = w ) {\displaystyle g(a\mid w)=P(A=a\mid W=w)} represent the propensity score. The average treatment effect (ATE) is given by ψ 0 = E { Q ¯ ( 1 , W ) − Q ¯ ( 0 , W ) } . {\displaystyle \psi _{0}=\mathbb {E} \{{\bar {Q}}(1,W)-{\bar {Q}}(0,W)\}.} A basic TMLE for the ATE proceeds as follows: Step 1: Estimate initial models. Obtain estimates Q ¯ ^ ( a , w ) {\displaystyle {\hat {\bar {Q}}}(a,w)} and g ^ ( a ∣ w ) {\displaystyle {\hat {g}}(a\mid w)} , often using flexible methods such as Super Learner. Step 2: Compute the clever covariate. Define: H ( A , W ) = A g ^ ( 1 ∣ W ) − 1 − A g ^ ( 0 ∣ W ) . {\displaystyle H(A,W)={\frac {A}{{\hat {g}}(1\mid W)}}-{\frac {1-A}{{\hat {g}}(0\mid W)}}.} Step 3: Estimate the fluctuation parameter. Fit a logistic regression of Y {\displaystyle Y} on H ( A , W ) {\displaystyle H(A,W)} with logit ⁡ ( Q ¯ ^ ( A , W ) ) {\displaystyle \operatorname {logit} ({\hat {\bar {Q}}}(A,W))} as offset. This yields ε ^ {\displaystyle {\hat {\varepsilon }}} , the MLE that solves the score equation: 1 n ∑ i = 1 n H ( A i , W i ) { Y i − Q ¯ ^ ε ( A i , W i ) } = 0. {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}H(A_{i},W_{i}){\big \{}Y_{i}-{\hat {\bar {Q}}}^{\varepsilon }(A_{i},W_{i}){\big \}}=0.} Step 4: Update the initial estimate. Apply the "blip" to obtain the targeted estimate: Q ¯ ^ ∗ ( A , W ) = expit ⁡ ( logit ⁡ ( Q ¯ ^ ( A , W ) ) + ε ^ H ( A , W ) ) . {\displaystyle {\hat {\bar {Q}}}^{}(A,W)=\operatorname {expit} {\Big (}\operatorname {logit} {\big (}{\hat {\bar {Q}}}(A,W){\big )}+{\hat {\varepsilon }}\,H(A,W){\Big )}.} Step 5: Compute the TMLE. The ATE estimate is: ψ ^ TMLE = 1 n ∑ i = 1 n [ Q ¯ ^ ∗ ( 1 , W i ) − Q ¯ ^ ∗ ( 0 , W i ) ] . {\displaystyle {\hat {\psi }}_{\text{TMLE}}={\frac {1}{n}}\sum _{i=1}^{n}{\big [}{\hat {\bar {Q}}}^{}(1,W_{i})-{\hat {\bar {Q}}}^{}(0,W_{i}){\big ]}.} Inference. The efficient influence function (EIF) for the ATE is: D ∗ ( O ) = H ( A , W ) { Y − Q ¯ ∗ ( A , W ) } + Q ¯ ∗ ( 1 , W ) − Q ¯ ∗ ( 0 , W ) − ψ . {\displaystyle D^{}(O)=H(A,W)\{Y-{\bar {Q}}^{}(A,W)\}+{\bar {Q}}^{}(1,W)-{\bar {Q}}^{}(0,W)-\psi .} The variance is estimated by σ ^ 2 = n − 1 ∑ i = 1 n ( D ∗ ( O i ) ) 2 {\displaystyle {\hat {\sigma }}^{2}=n^{-1}\sum _{i=1}^{n}{\big (}D^{}(O_{i}){\big )}^{2}} , yielding Wald-type confidence intervals ψ ^ TMLE ± z 1 − α / 2 σ ^ / n {\displaystyle {\hat {\psi }}_{\text{TMLE}}\pm z_{1-\alpha /2}\,{\hat {\sigma }}/{\sqrt {n}}} . Remark. For continuous outcomes, a linear fluctuation Q ¯ ^ ∗ = Q ¯ ^ + ε ^ H {\displaystyle {\hat {\bar {Q}}}^{}={\hat {\bar {Q}}}+{\hat {\varepsilon }}\,H} may be used instead. For bounded continuous outcomes, the logistic fluctuation (after rescaling Y {\displaystyle Y} to [ 0 , 1 ] {\displaystyle [0,1]} ) is often preferred for improved finite-sample performance. == Applications == TMLE has been applied in: Epidemiology: Estimating causal effects of exposures and interventions in observational cohort studies. Clinical trials and real-world evidence: The Targeted Learning roadmap provides a structured framework for generating and validating real-world evidence (RWE), bridging randomized trials and observational data using TMLE and related estimation techniques. This approach enables transparency, sensitivity analysis, and stronger causal inference for regulatory and clinical trial contexts. High-dimensional settings: Integration with ensemble methods for causal effect estimation. TMLE has been successfully applied in pharmacoepidemiology where a large number of covariates are automatically selected to adjust for confounding. In a study of post–myocardial infarction statin use and 1-year mortality, TMLE demonstrated robust performance relative to inverse probability weighting in scenarios with hundreds of potential confounders. == Derivatives and extensions == Longitudinal TMLE (LTMLE): A methodological extension of TMLE for longitudinal data with time-varying treatments, confounders, and censoring. It allows the estimation of dynamic treatment regimes and intervention-specific causal effects over time. This framework was originally introduced by van der Laan & Gruber (2012). Collaborative TMLE (CTMLE): Enhances finite-sample performance and variable selection by collaboratively fitting the treatment mechanism in conjunction with the target parameter. == Software == Several R packages implement TMLE and related methods: tmle: Functions for binary, categorical, and continuous outcomes. ltmle: Implementation for longitudinal data with time-varying treatments and outcomes. ctmle: Algorithms for collaborative TMLE and adaptive variable selection. SuperLearner: A theoretically grounded, cross-validated ensemble learning method that combines predictions from multiple algorithms to minimize predictive risk. Widely used in TMLE for estimating nuisance parameters. The original implementation is available as the R package SuperLearner. Recent machine learning platforms like H2O AutoML implement similar ensemble strategies, combining diverse learners in parallel and leveraging stacking and blending techniques, effectively functioning as a large-scale Super Learner.

    Read more →
  • Multi expression programming

    Multi expression programming

    Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not specific (multiple representations have been tested). In the simplest variant, MEP chromosomes are linear strings of instructions. This representation was inspired by Three-address code. MEP strength consists in the ability to encode multiple solutions, of a problem, in the same chromosome. In this way, one can explore larger zones of the search space. For most of the problems this advantage comes with no running-time penalty compared with genetic programming variants encoding a single solution in a chromosome. == Representation == MEP chromosomes are arrays of instructions represented in Three-address code format. Each instruction contains a variable, a constant, or a function. If the instruction is a function, then the arguments (given as instruction's addresses) are also present. === Example of MEP program === Here is a simple MEP chromosome (labels on the left side are not a part of the chromosome): 1: a 2: b 3: + 1, 2 4: c 5: d 6: + 4, 5 7: 3, 5 == Fitness computation == When the chromosome is evaluated it is unclear which instruction will provide the output of the program. In many cases, a set of programs is obtained, some of them being completely unrelated (they do not have common instructions). For the above chromosome, here is the list of possible programs obtained during decoding: E1 = a, E2 = b, E4 = c, E5 = d, E3 = a + b. E6 = c + d. E7 = (a + b) d. Each instruction is evaluated as a possible output of the program. The fitness (or error) is computed in a standard manner. For instance, in the case of symbolic regression, the fitness is the sum of differences (in absolute value) between the expected output (called target) and the actual output. == Fitness assignment process == Which expression will represent the chromosome? Which one will give the fitness of the chromosome? In MEP, the best of them (which has the lowest error) will represent the chromosome. This is different from other GP techniques: In Linear genetic programming the last instruction will give the output. In Cartesian Genetic Programming the gene providing the output is evolved like all other genes. Note that, for many problems, this evaluation has the same complexity as in the case of encoding a single solution in each chromosome. Thus, there is no penalty in running time compared to other techniques. == Software == === MEPX === MEPX is a cross-platform (Windows, macOS, and Linux Ubuntu) free software for the automatic generation of computer programs. It can be used for data analysis, particularly for solving symbolic regression, statistical classification and time-series problems. === libmep === Libmep is a free and open source library implementing Multi Expression Programming technique. It is written in C++. === hmep === hmep is a new open source library implementing Multi Expression Programming technique in Haskell programming language.

    Read more →
  • Rectified linear unit

    Rectified linear unit

    In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function: ReLU ⁡ ( x ) = x + = max ( 0 , x ) = x + | x | 2 = { x if x > 0 , 0 x ≤ 0 {\displaystyle \operatorname {ReLU} (x)=x^{+}=\max(0,x)={\frac {x+|x|}{2}}={\begin{cases}x&{\text{if }}x>0,\\0&x\leq 0\end{cases}}} where x {\displaystyle x} is the input to a neuron. This is analogous to half-wave rectification in electrical engineering. ReLU is one of the most popular activation functions for artificial neural networks, and finds application in computer vision and speech recognition using deep neural nets and computational neuroscience. == History == The ReLU was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks. Kunihiko Fukushima in 1969 used ReLU in the context of visual feature extraction in hierarchical neural networks. In 1998, Gregory Woodbury demonstrated that the rectified linear function could account for a broad range of emergent properties in the visual cortex. His work showed that a single unified model could drive the joint development of refined retinotopic maps, ocular dominance columns, and orientation selectivity. By utilizing the rectifier's "cutoff" property, Woodbury achieved a close quantitative fit to biological data, matching the spatial periodicities and topographic refinement patterns observed in macaque and cat cortical maps. Furthermore, he extended this framework to adult plasticity, accurately replicating the spatial and temporal dynamics of lesion-induced cortical reorganization. This research established that the rectified linear response was a necessary mechanism for the stable self-organisation and maintenance of complex, multi-feature neural maps. In 2000, Hahnloser et al. argued that ReLU approximates the biological relationship between neural firing rates and input current, in addition to enabling recurrent neural network dynamics to stabilise under weaker criteria. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more numerically efficient counterpart, the hyperbolic tangent. Around 2010, the use of ReLU became common again. Jarrett et al. (2009) noted that rectification by either absolute or ReLU (which they called "positive part") was critical for object recognition in convolutional neural networks (CNNs), specifically because it allows average pooling without neighboring filter outputs cancelling each other out. They hypothesized that the use of sigmoid or tanh was responsible for poor performance in previous CNNs. Nair and Hinton (2010) made a theoretical argument that the softplus activation function should be used, in that the softplus function numerically approximates the sum of an exponential number of linear models that share parameters. They then proposed ReLU as a good approximation to it. Specifically, they began by considering a single binary neuron in a Boltzmann machine that takes x {\displaystyle x} as input, and produces 1 as output with probability σ ( x ) = 1 1 + e − x {\displaystyle \sigma (x)={\frac {1}{1+e^{-x}}}} . They then considered extending its range of output by making infinitely many copies of it X 1 , X 2 , X 3 , … {\displaystyle X_{1},X_{2},X_{3},\dots } , that all take the same input, offset by an amount 0.5 , 1.5 , 2.5 , … {\displaystyle 0.5,1.5,2.5,\dots } , then their outputs are added together as ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} . They then demonstrated that ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} is approximately equal to N ( log ⁡ ( 1 + e x ) , σ ( x ) ) {\displaystyle {\mathcal {N}}(\log(1+e^{x}),\sigma (x))} , which is also approximately equal to ReLU ⁡ ( N ( x , σ ( x ) ) ) {\displaystyle \operatorname {ReLU} ({\mathcal {N}}(x,\sigma (x)))} , where N {\displaystyle {\mathcal {N}}} stands for the gaussian distribution. They also argued for another reason for using ReLU: that it allows "intensity equivariance" in image recognition. That is, multiplying input image by a constant k {\displaystyle k} multiplies the output also. In contrast, this is false for other activation functions like sigmoid or tanh. They found that ReLU activation allowed good empirical performance in restricted Boltzmann machines. Glorot et al (2011) argued that ReLU has the following advantages over sigmoid or tanh: ReLU is more similar to biological neurons' responses in their main operating regime. ReLU avoids vanishing gradients. ReLU is cheaper to compute. ReLU creates sparse representation naturally, because many hidden units output exactly zero for a given input. They also found empirically that deep networks trained with ReLU can achieve strong performance without unsupervised pre-training, especially on large, purely supervised tasks. In 2017, the rectified linear function became a central component of the transformer architecture introduced in the Vaswani et al paper "Attention Is All You Need". Within every transformer layer, ReLU is utilized in the position-wise feed-forward networks (FFN), defined by Equation 2 of their paper: FFN ⁡ ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 {\displaystyle \operatorname {FFN} (x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2}} This equation is foundational to the model's capacity; while the attention mechanism determines the relationships between tokens, the ReLU-based FFN performs the majority of the numerical computation and houses the bulk of the model's parameters. The efficiency and scalability of this rectified framework triggered a global technological revolution, enabling the development of Large Language Models that have had a profound economic impact. The industrial response to this architecture—including the massive expansion of AI-specific hardware and the birth of the generative AI sector—has positioned the Transformer as a cornerstone of 21st-century infrastructure. During the post 2017 period of rapid AI advancement, the rectified linear unit function has been key to achieving increased model performance and scaling due to the fact that it zeros out responses that are immaterial for a given stimuli, preventing them from accumulating in massive scale models. It is the complete silencing of the parts of the model found to be stimuli-irrelevant during learning that allows for scaling. As the stimuli-irrelevant proportion of the model becomes more massive, these highly numerous connections within the model would inevitably accumulate during scaling no matter how small each individual response is. Therefore, the rectified linear unit function, with its absolute zeroing property, enabled the scaling to hundred billion parameter models and beyond. Early Transformer scaling giants like GPT-3 (2020) and Falcon-180B (2023) relied on the rectified linear unit function explicitly, while successors such as GPT-4 (2023) and Llama 3 (2024) utilized smoother variants like GELU or SwiGLU. These variants were used to improve training stability while fundamentally preserving the rectified principle of zeroing low responses. At the centre of modern artificial intelligence ReLU and its variants maintain absolute zero response across the bulk of the model at any one time, while maintaining approximately linear reponses for stimuli-relevant connections enabling high performance on each specific cognitive task. This feature of activation sparsity has been critical for massive scaling and performance gains of AI models right up to the present day. == Advantages == Advantages of ReLU include: Sparse activation: for example, in a randomly initialized network, only about 50% of hidden units are activated (i.e. have a non-zero output). Better gradient propagation: fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions. Efficiency: only requires comparison and addition. Scale-invariant (homogeneous, or "intensity equivariance"): max ( 0 , a x ) = a max ( 0 , x ) for a ≥ 0 {\displaystyle \max(0,ax)=a\max(0,x){\text{ for }}a\geq 0} . == Potential problems == Possible downsides can include: Non-differentiability at zero (however, it is differentiable anywhere else, and the value of the derivative at zero can be chosen to be 0 or 1 arbitrarily). Not zero-centered: ReLU outputs are always non-negative. This can make it harder for the network to learn during backpropagation, because gradient updates tend to push weights in one direction (positive or negative). Batch normalization can help address this. ReLU is unbounded. Redundancy of the parametrization: Because ReLU is scale-invariant, the network computes the exact same function by scaling the weights and biases in front of a ReLU activation by k {\displaystyle k} , and the weights after by 1 / k {\displaystyle 1/k} . Dying ReLU: ReLU neurons can sometimes be pushed into states

    Read more →
  • DUAL table

    DUAL table

    The DUAL table is a special one-row, one-column table present by default in Oracle and other database installations. In Oracle, the table has a single VARCHAR2(1) column called DUMMY that has a value of 'X'. It is suitable for use in selecting a pseudo column such as SYSDATE or USER. == Example use == Oracle's SQL syntax requires the FROM clause but some queries don't require any tables - DUAL can be used in these cases. == History == Charles Weiss explains why he created DUAL: I created the DUAL table as an underlying object in the Oracle Data Dictionary. It was never meant to be seen itself, but instead used inside a view that was expected to be queried. The idea was that you could do a JOIN to the DUAL table and create two rows in the result for every one row in your table. Then, by using GROUP BY, the resulting join could be summarized to show the amount of storage for the DATA extent and for the INDEX extent(s). The name, DUAL, seemed apt for the process of creating a pair of rows from just one. == Optimization == Beginning with 10g Release 1, Oracle no longer performs physical or logical I/O on the DUAL table, though the table still exists. DUAL is readily available for all authorized users in a SQL database. == In other database systems == Several other databases (including Microsoft SQL Server, MySQL, PostgreSQL, SQLite, and Teradata) enable one to omit the FROM clause entirely if no table is needed. This avoids the need for any dummy table. ClickHouse has a one-row system table system.one with a single column named "dummy" of type UInt8 and value 0. This table is implicitly used when no table is specified in the SELECT query. Firebird has a one-row system table RDB$DATABASE that is used in the same way as Oracle's DUAL, although it also has a meaning of its own. IBM Db2 has a view that resolves DUAL when using Oracle Compatibility. It also has a table called sysibm.sysdummy1 that has similar properties to the Oracle DUAL one. Informix: Informix version 11.50 and later has a table named sysmaster:"informix".sysdual with the same functionality but a more verbose name. You can use CREATE PUBLIC SYNONYM dual FOR sysmaster:"informix".sysdual to create a name dual in the current database with the same functionality. Microsoft Access: A table named DUAL may be created and the single-row constraint enforced via ADO (Table-less UNION query in MS Access) Microsoft SQL Server: SQL Server does not require a dummy table. Queries like 'select 1 + 1' can be run without a "from" clause/table name. MySQL allows DUAL to be specified as a table in queries that do not need data from any tables. It is suitable for use in selecting a result function such as SYSDATE() or USER(), although it is not essential. PostgreSQL: A DUAL-view can be added to ease porting from Oracle. Snowflake: DUAL is supported, but not explicitly documented. It appears in sample SQL for other operations in the documentation. SQLite: A VIEW named "dual" that works the same as the Oracle "dual" table can be created as follows: CREATE VIEW dual AS SELECT 'x' AS dummy; SAP HANA has a table called DUMMY that works the same as the Oracle "dual" table. Teradata database does not require a dummy table. Queries like 'select 1 + 1' can be run without a "from" clause/table name. Vertica has support for a DUAL table in their official documentation.

    Read more →
  • Stochastic gradient descent

    Stochastic gradient descent

    Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. == Background == Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: Q ( w ) = 1 n ∑ i = 1 n Q i ( w ) , {\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),} where the parameter w {\displaystyle w} that minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated. Each summand function Q i {\displaystyle Q_{i}} is typically associated with the i {\displaystyle i} -th observation in the data set (used for training). In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. There, Q i ( w ) {\displaystyle Q_{i}(w)} is the value of the loss function at i {\displaystyle i} -th example, and Q ( w ) {\displaystyle Q(w)} is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: w := w − η ∇ Q ( w ) = w − η n ∑ i = 1 n ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w).} The step size is denoted by η {\displaystyle \eta } (sometimes called the learning rate in machine learning) and here " := {\displaystyle :=} " denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems. == Iterative method == In stochastic (or "on-line") gradient descent, the true gradient of Q ( w ) {\displaystyle Q(w)} is approximated by a gradient at a single sample: w := w − η ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q_{i}(w).} As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as : A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates η {\displaystyle \eta } decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem. == Linear regression == Suppose we want to fit a straight line y ^ = w 1 + w 2 x {\displaystyle {\hat {y}}=w_{1}+w_{2}x} to a training set with observations ( ( x 1 , y 1 ) , ( x 2 , y 2 ) … , ( x n , y n ) ) {\displaystyle ((x_{1},y_{1}),(x_{2},y_{2})\ldots ,(x_{n},y_{n}))} and corresponding estimated responses ( y ^ 1 , y ^ 2 , … , y ^ n ) {\displaystyle ({\hat {y}}_{1},{\hat {y}}_{2},\ldots ,{\hat {y}}_{n})} using least squares. The objective function to be minimized is Q ( w ) = ∑ i = 1 n Q i ( w ) = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( w 1 + w 2 x i − y i ) 2 . {\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y}}_{i}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.} The last line in the above pseudocode for this specific problem will become: [ w 1 w 2 ] ← [ w 1 w 2 ] − η [ ∂ ∂ w 1 ( w 1 + w 2 x i − y i ) 2 ∂ ∂ w 2 ( w 1 + w 2 x i − y i ) 2 ] = [ w 1 w 2 ] − η [ 2 ( w 1 + w 2 x i − y i ) 2 x i ( w 1 + w 2 x i − y i ) ] . {\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}\leftarrow {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.} Note that in each iteration or update step, the gradient is only evaluated at a single x i {\displaystyle x_{i}} . This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression y ^ = ∑ k ∈ 1 : m w k x k {\displaystyle {\hat {y}}=\sum _{k\in 1:m}w_{k}x_{k}} problem, stochastic gradient descent behaves differently when m < n {\displaystyle m

  • Markov blanket

    Markov blanket

    In statistics and machine learning, a Markov blanket of a random variable is a set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical models and feature selection. If a Markov blanket is minimal—meaning that no variable in it can be removed without losing this conditional independence—it is called a Markov boundary. Identifying a Markov blanket or boundary allows for efficient inference and helps isolate relevant variables for prediction or causal reasoning. The terms Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket may be derived from the structure of a probabilistic graphical model such as a Bayesian network or Markov random field. == Definition == A Markov blanket of a random variable Y {\displaystyle Y} in a random variable set S = { X 1 , … , X n } {\displaystyle {\mathcal {S}}=\{X_{1},\ldots ,X_{n}\}} is any subset S 1 {\displaystyle {\mathcal {S}}_{1}} of S {\displaystyle {\mathcal {S}}} , conditioned on which other variables are independent with Y {\displaystyle Y} : Y ⊥ ⊥ S ∖ S 1 ∣ S 1 {\displaystyle Y\perp \!\!\!\perp {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}\mid {\mathcal {S}}_{1}} It means that S 1 {\displaystyle {\mathcal {S}}_{1}} contains at least all the information one needs to infer Y {\displaystyle Y} , where the variables in S ∖ S 1 {\displaystyle {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}} are redundant. In general, a given Markov blanket is not unique. Any set in S {\displaystyle {\mathcal {S}}} that contains a Markov blanket is also a Markov blanket itself. Specifically, S {\displaystyle {\mathcal {S}}} is a Markov blanket of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} . === Example === In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and its children's other parents (i.e., co-parents). Knowing the values of these nodes makes the target node conditionally independent of the rest of the network. In a Markov random field, the Markov blanket of a node is simply its immediate neighbors. == Markov condition == The concept of a Markov blanket is rooted in the Markov condition, which states that in a probabilistic graphical model, each variable is conditionally independent of its non-descendants given its parents. This condition implies the existence of a minimal separating set — the Markov blanket — that shields a variable from the rest of the network. For instance, when a person holds an object stationary against gravity, the object’s acceleration is fully determined by its direct causes—namely, the upward force from the hand and the downward gravitational pull. Other variables such as air pressure or temperature are causally irrelevant. == Markov boundary == A Markov boundary of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} is a subset S 2 {\displaystyle {\mathcal {S}}_{2}} of S {\displaystyle {\mathcal {S}}} , such that S 2 {\displaystyle {\mathcal {S}}_{2}} itself is a Markov blanket of Y {\displaystyle Y} , but any proper subset of S 2 {\displaystyle {\mathcal {S}}_{2}} is not a Markov blanket of Y {\displaystyle Y} . In other words, a Markov boundary is a minimal Markov blanket. The Markov boundary of a node A {\displaystyle A} in a Bayesian network is the set of nodes composed of A {\displaystyle A} 's parents, A {\displaystyle A} 's children, and A {\displaystyle A} 's children's other parents. In a Markov random field, the Markov boundary for a node is the set of its neighboring nodes. In a dependency network, the Markov boundary for a node is the set of its parents. === Uniqueness of Markov boundary === The Markov boundary always exists. Under some mild conditions, the Markov boundary is unique. However, for most practical and theoretical scenarios multiple Markov boundaries may provide alternative solutions. When there are multiple Markov boundaries, quantities measuring causal effect could fail. == In cognitive science == In the study of consciousness, brain function, and complex adaptive systems, Markov blankets are proposed as a mathematical mechanism which delimits the extent of cognitive entities, whether it be physical or causal.

    Read more →