AI Content Quiz

AI Content Quiz — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Couch to 5K

    Couch to 5K

    Couch to 5K, abbreviated C25K, is an exercise plan that gradually progresses from beginner running toward a 5 kilometre (3.1 mile) run over nine weeks. == Operations == The Couch to 5K running plan, also known as C25K, created by Josh Clark in 1996, was developed with the expectation of creating a plan for new runners to start running. The plan is aimed to have users work out for 20 to 30 minutes, three days a week. Within the program, users can be expected to perform different tasks such as intervals of running with period of short walks in between to help build endurance in the weeks up to the final goal of a 5K run. During the nine weeks leading up to the race, the runner will learn to set their own pace and where their strengths and weaknesses are within running. Often, the daily workouts start with a five-minute warm-up walk and works up to running five kilometres without a walking break within nine weeks. Users are not expected to have any experience in running and can be some of the first running that they ever do. The main goal is to turn that unexperienced runner into someone who can run a 5K. Clark started the website Kick and featured C25K on the site. In 2001, Kick merged with Cool Running, a New England–based running site. Clark later sold his stake in Cool Running and the Couch to 5K program. Cool Running was absorbed into Active.com, operated by Active Network, LLC. Active Network provides mobile apps for Couch to 5K, as well as 5K to 10K, a follow-up program. The NHS in the UK provides downloadable podcasts and a smartphone app (Android and iOS) for the plan. A mobile app, created by Zen Labs, has training plans that are based on the Couch to 5K running plan from CoolRunning.com. It is one of the highest-rated health and fitness apps available on Android and iOS. As of 2016, the C25K app has been used by over 5 million people.

    Read more →
  • Genetic representation

    Genetic representation

    In computer programming, genetic representation is a way of presenting solutions/individuals in evolutionary computation methods. The term encompasses both the concrete data structures and data types used to realize the genetic material of the candidate solutions in the form of a genome, and the relationships between search space and problem space. In the simplest case, the search space corresponds to the problem space (direct representation). The choice of problem representation is tied to the choice of genetic operators, both of which have a decisive effect on the efficiency of the optimization. Genetic representation can encode appearance, behavior, physical qualities of individuals. Difference in genetic representations is one of the major criteria drawing a line between known classes of evolutionary computation. Terminology is often analogous with natural genetics. The block of computer memory that represents one candidate solution is called an individual. The data in that block is called a chromosome. Each chromosome consists of genes. The possible values of a particular gene are called alleles. A programmer may represent all the individuals of a population using binary encoding, permutational encoding, encoding by tree, or any one of several other representations. == Representations in some popular evolutionary algorithms == Genetic algorithms (GAs) are typically linear representations; these are often, but not always, binary. Holland's original description of GA used arrays of bits. Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size. This facilitates simple crossover operation. Depending on the application, variable-length representations have also been successfully used and tested in evolutionary algorithms (EA) in general and genetic algorithms in particular, although the implementation of crossover is more complex in this case. Evolution strategy uses linear real-valued representations, e.g., an array of real values. It uses mostly gaussian mutation and blending/averaging crossover. Genetic programming (GP) pioneered tree-like representations and developed genetic operators suitable for such representations. Tree-like representations are used in GP to represent and evolve functional programs with desired properties. Human-based genetic algorithm (HBGA) offers a way to avoid solving hard representation problems by outsourcing all genetic operators to outside agents, in this case, humans. The algorithm has no need for knowledge of a particular fixed genetic representation as long as there are enough external agents capable of handling those representations, allowing for free-form and evolving genetic representations. === Common genetic representations === binary array integer or real-valued array binary tree natural language parse tree directed graph == Distinction between search space and problem space == Analogous to biology, EAs distinguish between problem space (corresponds to phenotype) and search space (corresponds to genotype). The problem space contains concrete solutions to the problem being addressed, while the search space contains the encoded solutions. The mapping from search space to problem space is called genotype-phenotype mapping. The genetic operators are applied to elements of the search space, and for evaluation, elements of the search space are mapped to elements of the problem space via genotype-phenotype mapping. == Relationships between search space and problem space == The importance of an appropriate choice of search space for the success of an EA application was recognized early on. The following requirements can be placed on a suitable search space and thus on a suitable genotype-phenotype mapping: === Completeness === All possible admissible solutions must be contained in the search space. === Redundancy === When more possible genotypes exist than phenotypes, the genetic representation of the EA is called redundant. In nature, this is termed a degenerate genetic code. In the case of a redundant representation, neutral mutations are possible. These are mutations that change the genotype but do not affect the phenotype. Thus, depending on the use of the genetic operators, there may be phenotypically unchanged offspring, which can lead to unnecessary fitness determinations, among other things. Since the evaluation in real-world applications usually accounts for the lion's share of the computation time, it can slow down the optimization process. In addition, this can cause the population to have higher genotypic diversity than phenotypic diversity, which can also hinder evolutionary progress. In biology, the Neutral Theory of Molecular Evolution states that this effect plays a dominant role in natural evolution. This has motivated researchers in the EA community to examine whether neutral mutations can improve EA functioning by giving populations that have converged to a local optimum a way to escape that local optimum through genetic drift. This is discussed controversially and there are no conclusive results on neutrality in EAs. On the other hand, there are other proven measures to handle premature convergence. === Locality === The locality of a genetic representation corresponds to the degree to which distances in the search space are preserved in the problem space after genotype-phenotype mapping. That is, a representation has a high locality exactly when neighbors in the search space are also neighbors in the problem space. In order for successful schemata not to be destroyed by genotype-phenotype mapping after a minor mutation, the locality of a representation must be high. === Scaling === In genotype-phenotype mapping, the elements of the genotype can be scaled (weighted) differently. The simplest case is uniform scaling: all elements of the genotype are equally weighted in the phenotype. A common scaling is exponential. If integers are binary coded, the individual digits of the resulting binary number have exponentially different weights in representing the phenotype. Example: The number 90 is written in binary (i.e., in base two) as 1011010. If now one of the front digits is changed in the binary notation, this has a significantly greater effect on the coded number than any changes at the rear digits (the selection pressure has an exponentially greater effect on the front digits). For this reason, exponential scaling has the effect of randomly fixing the "posterior" locations in the genotype before the population gets close enough to the optimum to adjust for these subtleties. == Hybridization and repair in genotype-phenotype mapping == When mapping the genotype to the phenotype being evaluated, domain-specific knowledge can be used to improve the phenotype and/or ensure that constraints are met. This is a commonly used method to improve EA performance in terms of runtime and solution quality. It is illustrated below by two of the three examples. == Examples == === Example of a direct representation === An obvious and commonly used encoding for the traveling salesman problem and related tasks is to number the cities to be visited consecutively and store them as integers in the chromosome. The genetic operators must be suitably adapted so that they only change the order of the cities (genes) and do not cause deletions or duplications. Thus, the gene order corresponds to the city order and there is a simple one-to-one mapping. === Example of a complex genotype-phenotype mapping === In a scheduling task with heterogeneous and partially alternative resources to be assigned to a set of subtasks, the genome must contain all necessary information for the individual scheduling operations or it must be possible to derive them from it. In addition to the order of the subtasks to be executed, this includes information about the resource selection. A phenotype then consists of a list of subtasks with their start times and assigned resources. In order to be able to create this, as many allocation matrices must be created as resources can be allocated to one subtask at most. In the simplest case this is one resource, e.g., one machine, which can perform the subtask. An allocation matrix is a two-dimensional matrix, with one dimension being the available time units and the other being the resources to be allocated. Empty matrix cells indicate availability, while an entry indicates the number of the assigned subtask. The creation of allocation matrices ensures firstly that there are no inadmissible multiple allocations. Secondly, the start times of the subtasks can be read from it as well as the assigned resources. A common constraint when scheduling resources to subtasks is that a resource can only be allocated once per time unit and that the reservation must be for a contiguous period of time. To achieve this in a timely manner, which is a c

    Read more →
  • Least-squares support vector machine

    Least-squares support vector machine

    Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods. == From support-vector machine to least-squares support-vector machine == Given a training set { x i , y i } i = 1 N {\displaystyle \{x_{i},y_{i}\}_{i=1}^{N}} with input data x i ∈ R n {\displaystyle x_{i}\in \mathbb {R} ^{n}} and corresponding binary class labels y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} , the SVM classifier, according to Vapnik's original formulation, satisfies the following conditions: { w T ϕ ( x i ) + b ≥ 1 , if y i = + 1 , w T ϕ ( x i ) + b ≤ − 1 , if y i = − 1 , {\displaystyle {\begin{cases}w^{T}\phi (x_{i})+b\geq 1,&{\text{if }}\quad y_{i}=+1,\\w^{T}\phi (x_{i})+b\leq -1,&{\text{if }}\quad y_{i}=-1,\end{cases}}} which is equivalent to y i [ w T ϕ ( x i ) + b ] ≥ 1 , i = 1 , … , N , {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1,\quad i=1,\ldots ,N,} where ϕ ( x ) {\displaystyle \phi (x)} is the nonlinear map from original space to the high- or infinite-dimensional space. === Inseparable data === In case such a separating hyperplane does not exist, we introduce so-called slack variables ξ i {\displaystyle \xi _{i}} such that { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N . {\displaystyle {\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N.\end{cases}}} According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem: min J 1 ( w , ξ ) = 1 2 w T w + c ∑ i = 1 N ξ i , {\displaystyle \min J_{1}(w,\xi )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}\xi _{i},} Subject to { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N , {\displaystyle {\text{Subject to }}{\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N,\end{cases}}} To solve this problem, we could construct the Lagrangian function: L 1 ( w , b , ξ , α , β ) = 1 2 w T w + c ∑ i = 1 N ξ i − ∑ i = 1 N α i { y i [ w T ϕ ( x i ) + b ] − 1 + ξ i } − ∑ i = 1 N β i ξ i , {\displaystyle L_{1}(w,b,\xi ,\alpha ,\beta )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}{\xi _{i}}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{y_{i}\left[{w^{T}\phi (x_{i})+b}\right]-1+\xi _{i}\right\}-\sum \limits _{i=1}^{N}\beta _{i}\xi _{i},} where α i ≥ 0 , β i ≥ 0 ( i = 1 , … , N ) {\displaystyle \alpha _{i}\geq 0,\ \beta _{i}\geq 0\ (i=1,\ldots ,N)} are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain By substituting w {\displaystyle w} by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: max Q 1 ( α ) = − 1 2 ∑ i , j = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i , {\displaystyle \max Q_{1}(\alpha )=-{\frac {1}{2}}\sum \limits _{i,j=1}^{N}{\alpha _{i}\alpha _{j}y_{i}y_{j}K(x_{i},x_{j})}+\sum \limits _{i=1}^{N}\alpha _{i},} where K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ {\displaystyle K(x_{i},x_{j})=\left\langle \phi (x_{i}),\phi (x_{j})\right\rangle } is called the kernel function. Solving this QP problem subject to constraints in (1), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space. === Least-squares SVM formulation === The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as min J 2 ( w , b , e ) = μ 2 w T w + ζ 2 ∑ i = 1 N e i 2 , {\displaystyle \min J_{2}(w,b,e)={\frac {\mu }{2}}w^{T}w+{\frac {\zeta }{2}}\sum \limits _{i=1}^{N}e_{i}^{2},} subject to the equality constraints y i [ w T ϕ ( x i ) + b ] = 1 − e i , i = 1 , … , N . {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]=1-e_{i},\quad i=1,\ldots ,N.} The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets y i = ± 1 {\displaystyle y_{i}=\pm 1} . Using y i 2 = 1 {\displaystyle y_{i}^{2}=1} , we have ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i e i ) 2 = ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 , {\displaystyle \sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}(y_{i}e_{i})^{2}=\sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2},} with e i = y i − ( w T ϕ ( x i ) + b ) . {\displaystyle e_{i}=y_{i}-(w^{T}\phi (x_{i})+b).} Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case. Hence the LS-SVM classifier formulation is equivalent to J 2 ( w , b , e ) = μ E W + ζ E D {\displaystyle J_{2}(w,b,e)=\mu E_{W}+\zeta E_{D}} with E W = 1 2 w T w {\displaystyle E_{W}={\frac {1}{2}}w^{T}w} and E D = 1 2 ∑ i = 1 N e i 2 = 1 2 ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 . {\displaystyle E_{D}={\frac {1}{2}}\sum \limits _{i=1}^{N}e_{i}^{2}={\frac {1}{2}}\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2}.} Both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio γ = ζ / μ {\displaystyle \gamma =\zeta /\mu } , therefore the original formulation uses only γ {\displaystyle \gamma } as tuning parameter. We use both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } as parameters in order to provide a Bayesian interpretation to LS-SVM. The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: { L 2 ( w , b , e , α ) = J 2 ( w , e ) − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , = 1 2 w T w + γ 2 ∑ i = 1 N e i 2 − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , {\displaystyle {\begin{cases}L_{2}(w,b,e,\alpha )\;=J_{2}(w,e)-\sum \limits _{i=1}^{N}\alpha _{i}\left\{{\left[{w^{T}\phi (x_{i})+b}\right]+e_{i}-y_{i}}\right\},\\\quad \quad \quad \quad \quad \;={\frac {1}{2}}w^{T}w+{\frac {\gamma }{2}}\sum \limits _{i=1}^{N}e_{i}^{2}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{\left[w^{T}\phi (x_{i})+b\right]+e_{i}-y_{i}\right\},\end{cases}}} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } are the Lagrange multipliers. The conditions for optimality are { ∂ L 2 ∂ w = 0 → w = ∑ i = 1 N α i ϕ ( x i ) , ∂ L 2 ∂ b = 0 → ∑ i = 1 N α i = 0 , ∂ L 2 ∂ e i = 0 → α i = γ e i , i = 1 , … , N , ∂ L 2 ∂ α i = 0 → y i = w T ϕ ( x i ) + b + e i , i = 1 , … , N . {\displaystyle {\begin{cases}{\frac {\partial L_{2}}{\partial w}}=0\quad \to \quad w=\sum \limits _{i=1}^{N}\alpha _{i}\phi (x_{i}),\\{\frac {\partial L_{2}}{\partial b}}=0\quad \to \quad \sum \limits _{i=1}^{N}\alpha _{i}=0,\\{\frac {\partial L_{2}}{\partial e_{i}}}=0\quad \to \quad \alpha _{i}=\gamma e_{i},\;i=1,\ldots ,N,\\{\frac {\partial L_{2}}{\partial \alpha _{i}}}=0\quad \to \quad y_{i}=w^{T}\phi (x_{i})+b+e_{i},\,i=1,\ldots ,N.\end{cases}}} Elimination of w {\displaystyle w} and e {\displaystyle e} will yield a linear system instead of a quadratic programming problem: [ 0 1 N T 1 N Ω + γ − 1 I N ] [ b α ] = [ 0 Y ] , {\displaystyle \left[{\begin{matrix}0&1_{N}^{T}\\1_{N}&\Omega +\gamma ^{-1}I_{N}\end{matrix}}\right]\left[{\begin{matrix}b\\\alpha \end{matrix}}\right]=\left[{\begin{matrix}0\\Y\end{matrix}}\right],} with Y = [ y 1 , … , y N ] T {\displaystyle Y=[y_{1},\ldots ,y_{N}]^{T}} , 1 N = [ 1 , … , 1 ] T {\displaystyle 1_{N}=[1,\ldots ,1]^{T}} and α = [ α 1 , … , α N ] T {\displaystyle \alpha =[\alpha _{1},\ldots ,\alpha _{N}]^{T}} . Here, I N {\displaystyle I_{N}} is an N × N {\displaystyle N\times N} identity matrix, and Ω ∈ R N × N {\displaystyle \Omega \in \mathbb {R} ^{N\times N}} is the kernel matrix defined by Ω i j = ϕ ( x i ) T ϕ ( x j ) = K ( x i , x j ) {\displaystyle \Omega _{ij}=\phi (x_{i})^{T}\phi (x_{j})=K(x_{i},x_{j})} . === Kernel function K === For the kernel function K(•, •) one typically has the following choices: Linear kernel : K ( x , x i ) = x i T x , {\displaystyle K(x,x_{i})=x_{i}^{T}x,} Polynomial kernel of degree d {\displaystyle d} : K ( x , x i ) = ( 1 + x i T x / c ) d , {\displaystyle K(x,x_{i})=\left({1+x_{i}^{T}x/c}\right)^{d},} Radial basis function RBF kernel : K ( x , x i ) = exp ⁡ ( − ‖ x − x i ‖ 2 / σ 2 ) , {\displaystyle K(x,x_{i})=\exp \left({-\left\|{x-x_{i}}\right\|^{2}/\sigma ^{2}}\right),} MLP kernel : K ( x , x i ) = tanh ⁡ ( k x i T x + θ ) , {\displaystyle K(x,x_{i})=\tanh \left({k

    Read more →
  • Targeted maximum likelihood estimation

    Targeted maximum likelihood estimation

    Targeted Maximum Likelihood Estimation (TMLE) (also more accurately referred to as Targeted Minimum Loss-Based Estimation) is a general statistical estimation framework for causal inference and semiparametric models. TMLE combines ideas from maximum likelihood estimation, semiparametric efficiency theory, and machine learning. It was introduced by Mark J. van der Laan and colleagues in the mid-2000s as a method that yields asymptotically efficient plug-in estimators while allowing the use of flexible, data-adaptive algorithms such as ensemble machine learning for nuisance parameter estimation. TMLE is used in epidemiology, biostatistics, and the social sciences to estimate causal effects in observational and experimental studies. Applications of TMLE include Longitudinal TMLE (LTMLE) for time-varying treatments and confounders. Variations in how the targeting step in TMLE is carried out have resulted in various versions of TMLE such as Collaborative TMLE (CTMLE) and Adaptive TMLE for improved finite-sample performance and automated variable selection. == History == The TMLE framework was first described by van der Laan and Rubin (2006) as a general approach for the construction of efficient plug-in estimators of smooth features of the data density. It was demonstrated in the context of causal inference and missing data problems. It was developed to address limitations of traditional doubly robust methods, such as Augmented Inverse Probability Weighting (AIPW), by respecting the plug-in principle in the sense that it respects that the target parameter is a function of the data density that is an element of the statistical model. TMLE estimates the data density or relevant parts of it with machine learning and targets these machine learning fits before it is plugged in the target parameter mapping. In this manner, a TMLE always respects global knowledge and satisfies known bounds such as that the target parameter is a probability . Since its introduction, TMLE has been developed in a series of theoretical and applied papers, culminating in book-length treatments of the method and its applications to survival analysis, adaptive designs, and longitudinal data. == Methodology == At its core, TMLE is a two-step estimation procedure: Initial estimation: Machine learning methods (such as the Super Learner ensemble) are used to obtain flexible estimates of nuisance parameters, such as outcome regressions and propensity scores. Targeting step: The initial estimate is updated by solving a score equation (the efficient influence function) so that the final estimator is consistent, asymptotically normal, and efficient under mild regularity conditions. The targeted machine learning fit is then mapped into the corresponding estimator of the target parameter by simply plugging it in the target parameter mapping. This approach balances the bias–variance trade-off by combining data-adaptive estimation with semiparametric efficiency theory. TMLE is doubly robust, meaning it remains consistent if either the outcome model or the treatment model is consistently estimated. === Formula === Here we explain the TMLE of the average treatment effect of a binary treatment on an outcome adjusting for baseline covariates. Consider i.i.d. observations O i = ( W i , A i , Y i ) {\displaystyle O_{i}=(W_{i},A_{i},Y_{i})} from a distribution P 0 {\displaystyle P_{0}} , where W {\displaystyle W} are baseline covariates, A {\displaystyle A} is a binary treatment, and Y {\displaystyle Y} is an outcome. Let Q ¯ ( a , w ) = E [ Y ∣ A = a , W = w ] {\displaystyle {\bar {Q}}(a,w)=\mathbb {E} [Y\mid A=a,W=w]} represent the outcome model and g ( a ∣ w ) = P ( A = a ∣ W = w ) {\displaystyle g(a\mid w)=P(A=a\mid W=w)} represent the propensity score. The average treatment effect (ATE) is given by ψ 0 = E { Q ¯ ( 1 , W ) − Q ¯ ( 0 , W ) } . {\displaystyle \psi _{0}=\mathbb {E} \{{\bar {Q}}(1,W)-{\bar {Q}}(0,W)\}.} A basic TMLE for the ATE proceeds as follows: Step 1: Estimate initial models. Obtain estimates Q ¯ ^ ( a , w ) {\displaystyle {\hat {\bar {Q}}}(a,w)} and g ^ ( a ∣ w ) {\displaystyle {\hat {g}}(a\mid w)} , often using flexible methods such as Super Learner. Step 2: Compute the clever covariate. Define: H ( A , W ) = A g ^ ( 1 ∣ W ) − 1 − A g ^ ( 0 ∣ W ) . {\displaystyle H(A,W)={\frac {A}{{\hat {g}}(1\mid W)}}-{\frac {1-A}{{\hat {g}}(0\mid W)}}.} Step 3: Estimate the fluctuation parameter. Fit a logistic regression of Y {\displaystyle Y} on H ( A , W ) {\displaystyle H(A,W)} with logit ⁡ ( Q ¯ ^ ( A , W ) ) {\displaystyle \operatorname {logit} ({\hat {\bar {Q}}}(A,W))} as offset. This yields ε ^ {\displaystyle {\hat {\varepsilon }}} , the MLE that solves the score equation: 1 n ∑ i = 1 n H ( A i , W i ) { Y i − Q ¯ ^ ε ( A i , W i ) } = 0. {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}H(A_{i},W_{i}){\big \{}Y_{i}-{\hat {\bar {Q}}}^{\varepsilon }(A_{i},W_{i}){\big \}}=0.} Step 4: Update the initial estimate. Apply the "blip" to obtain the targeted estimate: Q ¯ ^ ∗ ( A , W ) = expit ⁡ ( logit ⁡ ( Q ¯ ^ ( A , W ) ) + ε ^ H ( A , W ) ) . {\displaystyle {\hat {\bar {Q}}}^{}(A,W)=\operatorname {expit} {\Big (}\operatorname {logit} {\big (}{\hat {\bar {Q}}}(A,W){\big )}+{\hat {\varepsilon }}\,H(A,W){\Big )}.} Step 5: Compute the TMLE. The ATE estimate is: ψ ^ TMLE = 1 n ∑ i = 1 n [ Q ¯ ^ ∗ ( 1 , W i ) − Q ¯ ^ ∗ ( 0 , W i ) ] . {\displaystyle {\hat {\psi }}_{\text{TMLE}}={\frac {1}{n}}\sum _{i=1}^{n}{\big [}{\hat {\bar {Q}}}^{}(1,W_{i})-{\hat {\bar {Q}}}^{}(0,W_{i}){\big ]}.} Inference. The efficient influence function (EIF) for the ATE is: D ∗ ( O ) = H ( A , W ) { Y − Q ¯ ∗ ( A , W ) } + Q ¯ ∗ ( 1 , W ) − Q ¯ ∗ ( 0 , W ) − ψ . {\displaystyle D^{}(O)=H(A,W)\{Y-{\bar {Q}}^{}(A,W)\}+{\bar {Q}}^{}(1,W)-{\bar {Q}}^{}(0,W)-\psi .} The variance is estimated by σ ^ 2 = n − 1 ∑ i = 1 n ( D ∗ ( O i ) ) 2 {\displaystyle {\hat {\sigma }}^{2}=n^{-1}\sum _{i=1}^{n}{\big (}D^{}(O_{i}){\big )}^{2}} , yielding Wald-type confidence intervals ψ ^ TMLE ± z 1 − α / 2 σ ^ / n {\displaystyle {\hat {\psi }}_{\text{TMLE}}\pm z_{1-\alpha /2}\,{\hat {\sigma }}/{\sqrt {n}}} . Remark. For continuous outcomes, a linear fluctuation Q ¯ ^ ∗ = Q ¯ ^ + ε ^ H {\displaystyle {\hat {\bar {Q}}}^{}={\hat {\bar {Q}}}+{\hat {\varepsilon }}\,H} may be used instead. For bounded continuous outcomes, the logistic fluctuation (after rescaling Y {\displaystyle Y} to [ 0 , 1 ] {\displaystyle [0,1]} ) is often preferred for improved finite-sample performance. == Applications == TMLE has been applied in: Epidemiology: Estimating causal effects of exposures and interventions in observational cohort studies. Clinical trials and real-world evidence: The Targeted Learning roadmap provides a structured framework for generating and validating real-world evidence (RWE), bridging randomized trials and observational data using TMLE and related estimation techniques. This approach enables transparency, sensitivity analysis, and stronger causal inference for regulatory and clinical trial contexts. High-dimensional settings: Integration with ensemble methods for causal effect estimation. TMLE has been successfully applied in pharmacoepidemiology where a large number of covariates are automatically selected to adjust for confounding. In a study of post–myocardial infarction statin use and 1-year mortality, TMLE demonstrated robust performance relative to inverse probability weighting in scenarios with hundreds of potential confounders. == Derivatives and extensions == Longitudinal TMLE (LTMLE): A methodological extension of TMLE for longitudinal data with time-varying treatments, confounders, and censoring. It allows the estimation of dynamic treatment regimes and intervention-specific causal effects over time. This framework was originally introduced by van der Laan & Gruber (2012). Collaborative TMLE (CTMLE): Enhances finite-sample performance and variable selection by collaboratively fitting the treatment mechanism in conjunction with the target parameter. == Software == Several R packages implement TMLE and related methods: tmle: Functions for binary, categorical, and continuous outcomes. ltmle: Implementation for longitudinal data with time-varying treatments and outcomes. ctmle: Algorithms for collaborative TMLE and adaptive variable selection. SuperLearner: A theoretically grounded, cross-validated ensemble learning method that combines predictions from multiple algorithms to minimize predictive risk. Widely used in TMLE for estimating nuisance parameters. The original implementation is available as the R package SuperLearner. Recent machine learning platforms like H2O AutoML implement similar ensemble strategies, combining diverse learners in parallel and leveraging stacking and blending techniques, effectively functioning as a large-scale Super Learner.

    Read more →
  • Percept (artificial intelligence)

    Percept (artificial intelligence)

    A percept is the input that an intelligent agent is perceiving at any given moment. It is essentially the same concept as a percept in psychology, except that it is being perceived not by the brain but by the agent. A percept is detected by a sensor, often a camera, processed accordingly, and acted upon by an actuator. Each percept is added to a "percept sequence", which is a complete history of each percept ever detected. The agent's action at any instant point may depend on the entire percept sequence up to that particular instant point. An intelligent agent chooses how to act not only based on the current percept, but the percept sequence. The next action is chosen by the agent function, which maps every percept to an action. For example, if a camera were to record a gesture, the agent would process the percepts, calculate the corresponding spatial vectors, examine its percept history, and use the agent program (the application of the agent function) to act accordingly. == Examples == Examples of percepts include inputs from touch sensors, cameras, infrared sensors, sonar, microphones, mice, and keyboards. A percept can also be a higher-level feature of the data, such as lines, depth, objects, faces, or gestures.

    Read more →
  • Time-aware long short-term memory

    Time-aware long short-term memory

    Time-aware LSTM (T-LSTM) is a long short-term memory (LSTM) unit capable of handling irregular time intervals in longitudinal patient records. T-LSTM was developed by researchers from Michigan State University, IBM Research, and Cornell University and was first presented in the Knowledge Discovery and Data Mining (KDD) conference. Experiments using real and synthetic data proved that T-LSTM auto-encoder outperformed widely used frameworks including LSTM and MF1-LSTM auto-encoders.

    Read more →
  • Oscillatory neural network

    Oscillatory neural network

    An oscillatory neural network (ONN) is an artificial neural network that uses coupled oscillators as neurons. Oscillatory neural networks are closely linked to the Kuramoto model, and are inspired by the phenomenon of neural oscillations in the brain. Oscillatory neural networks have been trained to recognize images. Complex-Valued Oscillatory network has also been shown to store and retrieve multidimensional aperiodic signals. An oscillatory autoencoder has also been demonstrated, which uses a combination of oscillators and rate-coded neurons. A neuron made of two coupled oscillators, one having a fixed and the other having a tunable natural frequency, has been shown able to run logic gates such as XOR that conventional sigmoid neurons cannot.

    Read more →
  • Shattered set

    Shattered set

    A class of sets is said to shatter another set if it is possible to "pick out" any element of that set using intersection. The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory. == Definition == Suppose A is a set and C is a class of sets. The class C shatters the set A if for each subset a of A, there is some element c of C such that a = c ∩ A . {\displaystyle a=c\cap A.} Equivalently, C shatters A when their intersection is equal to A's power set: P(A) = { c ∩ A | c ∈ C }. We employ the letter C to refer to a "class" or "collection" of sets, as in a Vapnik–Chervonenkis class (VC-class). The set A is often assumed to be finite because, in empirical processes, we are interested in the shattering of finite sets of data points. == Example == We will show that the class of all discs in the plane (two-dimensional space) does not shatter every set of four points on the unit circle, yet the class of all convex sets in the plane does shatter every finite set of points on the unit circle. Let A be a set of four points on the unit circle and let C be the class of all discs. To test where C shatters A, we attempt to draw a disc around every subset of points in A. First, we draw a disc around the subsets of each isolated point. Next, we try to draw a disc around every subset of point pairs. This turns out to be doable for adjacent points, but impossible for points on opposite sides of the circle. Any attempt to include those points on the opposite side will necessarily include other points not in that pair. Hence, any pair of opposite points cannot be isolated out of A using intersections with class C and so C does not shatter A. As visualized below: Because there is some subset which can not be isolated by any disc in C, we conclude then that A is not shattered by C. And, with a bit of thought, we can prove that no set of four points is shattered by this C. However, if we redefine C to be the class of all elliptical discs, we find that we can still isolate all the subsets from above, as well as the points that were formerly problematic. Thus, this specific set of 4 points is shattered by the class of elliptical discs. Visualized below: With a bit of thought, we could generalize that any set of finite points on a unit circle could be shattered by the class of all convex sets (visualize connecting the dots). == Shatter coefficient == To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets s ⊂ Ω {\displaystyle s\subset \Omega } , Ω {\displaystyle \Omega } being any space, often a sample space, we define the nth shattering coefficient of C as S C ( n ) = max ∀ x 1 , x 2 , … , x n ∈ Ω card ⁡ { { x 1 , x 2 , … , x n } ∩ s , s ∈ C } {\displaystyle S_{C}(n)=\max _{\forall x_{1},x_{2},\dots ,x_{n}\in \Omega }\operatorname {card} \{\,\{\,x_{1},x_{2},\dots ,x_{n}\}\cap s,s\in C\}} where card {\displaystyle \operatorname {card} } denotes the cardinality of the set and x 1 , x 2 , … , x n ∈ Ω {\displaystyle x_{1},x_{2},\dots ,x_{n}\in \Omega } is any set of n points,. S C ( n ) {\displaystyle S_{C}(n)} is the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C. For example, if set A contains 3 points, its power set, P ( A ) {\displaystyle P(A)} , contains 2 3 = 8 {\displaystyle 2^{3}=8} elements. If C shatters A, its shattering coefficient(3) would be 8 and S C ( 2 ) {\displaystyle S_{C}(2)} would be 2 2 = 4 {\displaystyle 2^{2}=4} . However, if one of those sets in P ( A ) {\displaystyle P(A)} cannot be obtained through intersections in c, then S C ( 3 ) {\displaystyle S_{C}(3)} would only be 7. If none of those sets can be obtained, S C ( 3 ) {\displaystyle S_{C}(3)} would be 0. Additionally, if S C ( 2 ) = 3 {\displaystyle S_{C}(2)=3} , for example, then there is an element in the set of all 2-point sets from A that cannot be obtained from intersections with C. It follows from this that S C ( 3 ) {\displaystyle S_{C}(3)} would also be less than 8 (i.e. C would not shatter A) because we have already located a "missing" set in the smaller power set of 2-point sets. This example illustrates some properties of S C ( n ) {\displaystyle S_{C}(n)} : S C ( n ) ≤ 2 n {\displaystyle S_{C}(n)\leq 2^{n}} for all n because { s ∩ A | s ∈ C } ⊆ P ( A ) {\displaystyle \{s\cap A|s\in C\}\subseteq P(A)} for any A ⊆ Ω {\displaystyle A\subseteq \Omega } . If S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} , that means there is a set of cardinality n, which can be shattered by C. If S C ( N ) < 2 N {\displaystyle S_{C}(N)<2^{N}} for some N > 1 {\displaystyle N>1} then S C ( n ) < 2 n {\displaystyle S_{C}(n)<2^{n}} for all n ≥ N {\displaystyle n\geq N} . The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities. == Vapnik–Chervonenkis class == If A cannot be shattered by C, there will be a smallest value of n that makes the shatter coefficient(n) less than 2 n {\displaystyle 2^{n}} because as n gets larger, there are more sets that could be missed. Alternatively, there is also a largest value of n for which the S C ( n ) {\displaystyle S_{C}(n)} is still 2 n {\displaystyle 2^{n}} , because as n gets smaller, there are fewer sets that could be omitted. The extreme of this is S C ( 0 ) {\displaystyle S_{C}(0)} (the shattering coefficient of the empty set), which must always be 2 0 = 1 {\displaystyle 2^{0}=1} . These statements lends themselves to defining the VC dimension of a class C as: V C ( C ) = min n { n : S C ( n ) < 2 n } {\displaystyle VC(C)={\underset {n}{\min }}\{n:S_{C}(n)<2^{n}\}\,} or, alternatively, as V C 0 ( C ) = max n { n : S C ( n ) = 2 n } . {\displaystyle VC_{0}(C)={\underset {n}{\max }}\{n:S_{C}(n)=2^{n}\}.\,} Note that V C ( C ) = V C 0 ( C ) + 1. {\displaystyle VC(C)=VC_{0}(C)+1.} . The VC dimension is usually defined as V C 0 {\displaystyle VC_{0}} , the largest cardinality of points chosen that will still shatter A (i.e. n such that S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} ). Altneratively, if for any n there is a set of cardinality n which can be shattered by C, then S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} for all n and the VC dimension of this class C is infinite. A class with finite VC dimension is called a Vapnik–Chervonenkis class or VC class. A class C is uniformly Glivenko–Cantelli if and only if it is a VC class.

    Read more →
  • Non-human

    Non-human

    Non-human (also spelled nonhuman) is any entity displaying some, but not enough, human characteristics to be considered a human. The term has been used in a variety of contexts and may refer to objects that have been developed with human intelligence, such as robots or vehicles. == Organisms == === Animal rights and personhood === In the animal rights movement, it is common to distinguish between "human animals" and "non-human animals". Participants in the animal rights movement generally recognize that non-human animals have some similar characteristics to those of human persons. For example, various non-human animals have been shown to register pain, compassion, memory, and some cognitive function. Some animal rights activists argue that the similarities between human and non-human animals justify giving non-human animals rights that human society has afforded to humans, such as the right to self-preservation, and some even wish for all non-human animals or at least those that bear a fully thinking and conscious mind, such as vertebrates and some invertebrates such as cephalopods, to be given a full right of personhood. === The non-human in philosophy === Contemporary philosophers have drawn on the work of Henri Bergson, Gilles Deleuze, Félix Guattari, and Claude Lévi-Strauss (among others) to suggest that the non-human poses epistemological and ontological problems for humanist and post-humanist ethics, and have linked the study of non-humans to materialist and ethological approaches to the study of society and culture. == Software and robots == The term non-human has been used to describe computer programs and robot-like devices that display some human-like characteristics. In both science fiction and in the real world, computer programs and robots have been built to perform tasks that require human-computer interactions in a manner that suggests sentience and compassion. There is increasing interest in the use of robots in nursing homes and to provide elder care. Computer programs have been used for years in schools to provide one-on-one education with children. The Tamagotchi toy required children to provide care, attention, and nourishment to keep it "alive".

    Read more →
  • Memtransistor

    Memtransistor

    The memtransistor (a blend word from Memory Transfer Resistor) is an experimental multi-terminal passive electronic component that might be used in the construction of artificial neural networks. It is a combination of the memristor and transistor technology. This technology is different from the 1T-1R approach since the devices are merged into one single entity. Multiple memristors can be embedded with a single transistor, enabling it to more accurately model a neuron with its multiple synaptic connections. A neural network produced from these would provide hardware-based artificial intelligence with a good foundation. == Applications == These types of devices would allow for a synapse model that could realise a learning rule, by which the synaptic efficacy is altered by voltages applied to the terminals of the device. An example of such a learning rule is spike-timing-dependant-plasticty by which the weight of the synapse, in this case the conductivity, could be modulated based on the timing of pre and post synaptic spikes arriving at each terminal. The advantage of this approach over two terminal memristive devices is that read and write protocols have the possibility to occur simultaneously and distinctly.

    Read more →
  • Large margin nearest neighbor

    Large margin nearest neighbor

    Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization. The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large margin nearest neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule. == Setup == The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set D = { ( x → 1 , y 1 ) , … , ( x → n , y n ) } ⊂ R d × C {\displaystyle D=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}\subset R^{d}\times C} , where the set of possible class categories is C = { 1 , … , c } {\displaystyle C=\{1,\dots ,c\}} . The algorithm learns a pseudometric of the type d ( x → i , x → j ) = ( x → i − x → j ) ⊤ M ( x → i − x → j ) {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})=({\vec {x}}_{i}-{\vec {x}}_{j})^{\top }\mathbf {M} ({\vec {x}}_{i}-{\vec {x}}_{j})} . For d ( ⋅ , ⋅ ) {\displaystyle d(\cdot ,\cdot )} to be well defined, the matrix M {\displaystyle \mathbf {M} } needs to be positive semi-definite. The Euclidean metric is a special case, where M {\displaystyle \mathbf {M} } is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis metric. Figure 1 illustrates the effect of the metric under varying M {\displaystyle \mathbf {M} } . The two circles show the set of points with equal distance to the center x → i {\displaystyle {\vec {x}}_{i}} . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid. The algorithm distinguishes between two types of special data points: target neighbors and impostors. === Target neighbors === Target neighbors are selected before learning. Each instance x → i {\displaystyle {\vec {x}}_{i}} has exactly k {\displaystyle k} different target neighbors within D {\displaystyle D} , which all share the same class label y i {\displaystyle y_{i}} . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point x → i {\displaystyle {\vec {x}}_{i}} as N i {\displaystyle N_{i}} . === Impostors === An impostor of a data point x → i {\displaystyle {\vec {x}}_{i}} is another data point x → j {\displaystyle {\vec {x}}_{j}} with a different class label (i.e. y i ≠ y j {\displaystyle y_{i}\neq y_{j}} ) which is one of the nearest neighbors of x → i {\displaystyle {\vec {x}}_{i}} . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set. == Algorithm == Large margin nearest neighbors optimizes the matrix M {\displaystyle \mathbf {M} } with the help of semidefinite programming. The objective is twofold: For every data point x → i {\displaystyle {\vec {x}}_{i}} , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector x → i {\displaystyle {\vec {x}}_{i}} to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k = 3 {\displaystyle k=3} nearest neighbor rule. The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors ∑ i , j ∈ N i d ( x → i , x → j ) {\displaystyle \sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})} . The second goal is achieved by penalizing distances to impostors x → l {\displaystyle {\vec {x}}_{l}} that are less than one unit further away than target neighbors x → j {\displaystyle {\vec {x}}_{j}} (and therefore pushing them out of the local neighborhood of x → i {\displaystyle {\vec {x}}_{i}} ). The resulting value to be minimized can be stated as: ∑ i , j ∈ N i , l , y l ≠ y i [ d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ] + {\displaystyle \sum _{i,j\in N_{i},l,y_{l}\neq y_{i}}[d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})]_{+}} With a hinge loss function [ ⋅ ] + = max ( ⋅ , 0 ) {\textstyle [\cdot ]_{+}=\max(\cdot ,0)} , which ensures that impostor proximity is not penalized when outside the margin. The margin of exactly one unit fixes the scale of the matrix M {\displaystyle M} . Any alternative choice c > 0 {\displaystyle c>0} would result in a rescaling of M {\displaystyle M} by a factor of 1 / c {\displaystyle 1/c} . The final optimization problem becomes: min M ∑ i , j ∈ N i d ( x → i , x → j ) + λ ∑ i , j , l ξ i j l {\displaystyle \min _{\mathbf {M} }\sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})+\lambda \sum _{i,j,l}\xi _{ijl}} ∀ i , j ∈ N i , l , y l ≠ y i {\displaystyle \forall _{i,j\in N_{i},l,y_{l}\neq y_{i}}} d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ≤ ξ i j l {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})\leq \xi _{ijl}} ξ i j l ≥ 0 {\displaystyle \xi _{ijl}\geq 0} M ⪰ 0 {\displaystyle \mathbf {M} \succeq 0} The hyperparameter λ > 0 {\textstyle \lambda >0} is some positive constant (typically set through cross-validation). Here the variables ξ i j l {\displaystyle \xi _{ijl}} (together with two types of constraints) replace the term in the cost function. They play a role similar to slack variables to absorb the extent of violations of the impostor constraints. The last constraint ensures that M {\displaystyle \mathbf {M} } is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime (i.e. the set of variables ξ i j l {\displaystyle \xi _{ijl}} is sparse). A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness. == Extensions and efficient solvers == LMNN was extended to multiple local metrics in the 2008 paper. This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page. Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.

    Read more →
  • TabPFN

    TabPFN

    TabPFN (Tabular Prior-data Fitted Network) is a machine learning model for tabular datasets proposed in 2022. It uses a transformer architecture. It is intended for supervised classification and regression analysis on tabular datasets, particularly focusing on small- to medium-sized datasets. The latest version, TabPFN-3, was released in May 2026 and supports datasets with up to one million rows and 200 features. == History == TabPFN was first introduced in a 2022 pre-print and presented at ICLR 2023. TabPFN v2 was published in 2025 in Nature by Hollmann and co-authors. The source code is published on GitHub under a modified Apache License and on PyPi. Writing for ICLR blogs, McCarter states that the model has attracted attention due to its performance on small dataset benchmarks. TabPFN v2.5 was released on November 6, 2025. TabPFN-3 was released on May 12, 2026. Prior Labs, founded in 2024, aims to commercialize TabPFN. As of April 2026, the open-source TabPFN repository had more than 6,000 stars on GitHub. == Overview and pre-training == TabPFN supports classification, regression and generative tasks. It leverages "Prior-Data Fitted Networks" models to model tabular data. By using a transformer pre-trained on synthetic tabular datasets, TabPFN avoids benchmark contamination and costs of curating real-world data. TabPFN v2 was pre-trained on approximately 130 million such datasets. Synthetic datasets are generated using causal models or Bayesian neural networks; this can include simulating missing values, imbalanced data, and noise. Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass. The new dataset is then processed in a single forward pass without retraining. The model's transformer encoder processes features and labels by alternating attention across rows and columns. TabPFN v2 handles numerical and categorical features, missing values, and supports tasks like regression and synthetic data generation, while TabPFN-2.5 scales this approach to datasets with up to 50,000 rows and 2,000 features. TabPFN-3 introduced a redesigned architecture with row-compression, an attention-based many-class decoder, native missing-value handling, and inference optimizations such as row chunking and a reduced key-value cache, with benchmark-validated regimes of up to 1 million rows with 200 features, 100,000 rows with 2,000 features, or 1,000 rows with 20,000 features. Since TabPFN is pre-trained, in contrast to other deep learning methods, it does not require costly hyperparameter optimization. == Research == TabPFN is the subject of on-going research. Applications for TabPFN have been investigated for domains such as chemoproteomics, insurance risk classification, and metagenomics. In clinical research, TabPFN was used in a study on the early detection of pancreatic cancer from blood samples, where it was combined with metabolomic data and reported high diagnostic performance. == Applications == TabPFN has been used in industrial and biomedical contexts. Hitachi Ltd. has been reported to use the model for predictive maintenance in rail networks, with its use described as helping to identify track issues earlier and reduce manual inspections. In the biomedical domain, Oxford Cancer Analytics has used TabPFN in the analysis of proteomic data in lung disease research. A 2025 ML Contests report noted that the winners of DrivenData's PREPARE challenge used TabPFN to generate features for gradient-boosted decision tree models. == Limitations == TabPFN has been criticized for its "one large neural network is all you need" approach to modeling problems. Further, its performance is limited in high-dimensional and large-scale datasets. == Scaling Mode == In late November 2025, Prior Labs introduced ‘‘Scaling Mode’’, an operating mode for TabPFN designed to remove the fixed upper bound on training set size. Earlier versions of TabPFN had been optimized and validated primarily for datasets of up to 100,000 rows, whereas Scaling Mode was reported to extend support to substantially larger datasets, with benchmarked experiments on datasets containing up to 10 million rows. According to Prior Labs, Scaling Mode preserves the existing TabPFN architecture, including its alternating row-attention and column-attention design, as well as the same feature-count limits as prior releases. Inference remains based on a single forward pass rather than dataset-specific gradient-descent training, while scalability is described as being constrained primarily by available compute and memory resources. Prior Labs reported benchmark results on an internal collection of datasets ranging from 1 million to 10 million rows across industry and scientific applications. In these benchmarks, Scaling Mode was compared with CatBoost, XGBoost, LightGBM, and TabPFN 2.5 using 50,000-row subsampling. The company stated that predictive performance improved monotonically with increasing training set size and that no diminishing returns from scaling were observed within the tested range. Prior Labs also announced the release through company and executive social media channels. TabPFN-3 later incorporated related scaling improvements, including row chunking and a reduced key-value cache, into the model architecture and inference pipeline.

    Read more →
  • Kuwahara filter

    Kuwahara filter

    The Kuwahara filter is a non-linear smoothing filter used in image processing for adaptive noise reduction. Most filters that are used for image smoothing are linear low-pass filters that effectively reduce noise but also blur out the edges. However the Kuwahara filter is able to apply smoothing on the image while preserving the edges. It is named after Michiyoshi Kuwahara, Ph.D., who worked at Kyoto and Osaka Sangyo Universities in Japan, developing early medical imaging of dynamic heart muscle in the 1970s and 80s. == The Kuwahara operator == Suppose that I ( x , y ) {\displaystyle I(x,y)} is a grey scale image and that we take a square window of size 2 a + 1 {\displaystyle 2a+1} centered around a point ( x , y ) {\displaystyle (x,y)} in the image. This square can be divided into four smaller square regions Q i = 1 ⋯ 4 {\displaystyle Q_{i=1\cdots 4}} each of which will be Q i ( x , y ) = { [ x , x + a ] × [ y , y + a ] if i = 1 [ x − a , x ] × [ y , y + a ] if i = 2 [ x − a , x ] × [ y − a , y ] if i = 3 [ x , x + a ] × [ y − a , y ] if i = 4 {\displaystyle Q_{i}(x,y)={\begin{cases}\left[x,x+a\right]\times \left[y,y+a\right]&{\mbox{ if }}i=1\\\left[x-a,x\right]\times \left[y,y+a\right]&{\mbox{ if }}i=2\\\left[x-a,x\right]\times \left[y-a,y\right]&{\mbox{ if }}i=3\\\left[x,x+a\right]\times \left[y-a,y\right]&{\mbox{ if }}i=4\\\end{cases}}} where × {\displaystyle \times } is the cartesian product. Pixels located on the borders between two regions belong to both regions so there is a slight overlap between subregions. The arithmetic mean m i ( x , y ) {\displaystyle m_{i}(x,y)} and standard deviation σ i ( x , y ) {\displaystyle \sigma _{i}(x,y)} of the four regions centered around a pixel (x,y) are calculated and used to determine the value of the central pixel. The output of the Kuwahara filter Φ ( x , y ) {\displaystyle \Phi (x,y)} for any point ( x , y ) {\displaystyle (x,y)} is then given by Φ ( x , y ) = m i ( x , y ) {\textstyle \Phi (x,y)=m_{i}(x,y)} where i = a r g min j ⁡ σ j ( x , y ) {\displaystyle i=\operatorname {arg\min } _{j}\sigma _{j}(x,y)} . This means that the central pixel will take the mean value of the area that is most homogenous. The location of the pixel in relation to an edge plays a great role in determining which region will have the greater standard deviation. If for example the pixel is located on a dark side of an edge it will most probably take the mean value of the dark region. On the other hand, should the pixel be on the lighter side of an edge it will most probably take a light value. On the event that the pixel is located on the edge it will take the value of the more smooth, least textured region. The fact that the filter takes into account the homogeneity of the regions ensures that it will preserve the edges while using the mean creates the blurring effect. Similarly to the median filter, the Kuwahara filter uses a sliding window approach to access every pixel in the image. The size of the window is chosen in advance and may vary depending on the desired level of blur in the final image. Bigger windows typically result in the creation of more abstract images whereas small windows produce images that retain their detail. Typically windows are chosen to be square with sides that have an odd number of pixels for symmetry. However, there are variations of the Kuwahara filter that use rectangular windows. Additionally, the subregions do not need to overlap or have the same size as long as they cover all of the window. == Color images == For color images, the filter should not be performed by applying the filter to each RGB channel separately, and then recombining the three filtered color channels to form the filtered RGB image. The main problem with that is that the quadrants will have different standard deviations for each of the channels. For example, the upper left quadrant may have the lowest standard deviation in the red channel, but the lower right quadrant may have the lowest standard deviation in the green channel. This situation would result in the color of the central pixel to be determined by different regions, which might result in color artifacts or blurrier edges. To overcome this problem, for color images a slightly modified Kuwahara filter must be used. The image is first converted into another color space, the HSV color space. The modified filter then operates on only the "brightness" channel, the Value coordinate in the HSV model. The variance of the "brightness" of each quadrant is calculated to determine the quadrant from which the final filtered color should be taken from. The filter will produce an output for each channel which will correspond to the mean of that channel from the quadrant that had the lowest standard deviation in "brightness". This ensures that only one region will determine the RGB values of the central pixel. ImageMagick uses a similar approach, but using the Rec. 709 Luma as the brightness metric. === Julia Implementation === == Applications == Originally the Kuwahara filter was proposed for use in processing RI-angiocardiographic images of the cardiovascular system. The fact that any edges are preserved when smoothing makes it especially useful for feature extraction and segmentation and explains why it is used in medical imaging. The Kuwahara filter however also finds many applications in artistic imaging and fine-art photography due to its ability to remove textures and sharpen the edges of photographs. The level of abstraction helps create a desirable painting-like effect in artistic photographs especially in the case of the colored image version of the filter. These applications have known great success and have encouraged similar research in the field of image processing for the arts. Although the vast majority of applications have been in the field of image processing there have been cases that use modifications of the Kuwahara filter for machine learning tasks such as clustering. The Kuwahara filter has been implemented in CVIPtools. The Kuwahara filter is present as a shader node in Blender. == Drawbacks and restrictions == The Kuwahara filter despite its capabilities in edge preservation has certain drawbacks. At a first glance it is noticeable that the Kuwahara filter does not take into account the case where two regions have equal standard deviations. This is not often the case in real images since it is rather hard to find two regions with exactly the same standard deviation due to the noise that is always present. In cases where two regions have similar standard deviations the value of the center pixel could be decided at random by the noise in these regions. Again this would not be a problem if the regions had the same mean. However, it is not unusual for regions of very different means to have the same standard deviation. This makes the Kuwahara filter susceptible to noise. Different ways have been proposed for dealing with this issue, one of which is to set the value of the center pixel to ( m 1 + m 2 ) / 2 {\textstyle (m_{1}+m_{2})/2} in cases where the standard deviation of two regions do not differ more than a certain value D {\displaystyle D} . The Kuwahara filter is also known to create block artifacts in the images especially in regions of the image that are highly textured. These blocks disrupt the smoothness of the image and are considered to have a negative effect in the aesthetics of the image. This phenomenon occurs due to the division of the window into square regions. A way to overcome this effect is to take windows that are not rectangular(i.e. circular windows) and separate them into more non-rectangular regions. There have also been approaches where the filter adapts its window depending on the input image. == Extensions of the Kuwahara filter == The success of the Kuwahara filter has spurred an increase the development of edge-enhancing smoothing filters. Several variations have been proposed for similar use most of which attempt to deal with the drawbacks of the original Kuwahara filter. The "Generalized Kuwahara filter" proposed by P. Bakker considers several windows that contain a fixed pixel. Each window is then assigned an estimate and a confidence value. The value of the fixed pixel then takes the value of the estimate of the window with the highest confidence. This filter is not characterized by the same ambiguity in the presence of noise and manages to eliminate the block artifacts. The "Mean of Least Variance"(MLV) filter, proposed by M.A. Schulze also produces edge-enhancing smoothing results in images. Similarly to the Kuwahara filter it assumes a window of size 2 d − 1 × 2 d − 1 {\displaystyle 2d-1\times 2d-1} but instead of searching amongst four subregions of size d × d {\displaystyle d\times d} for the one with minimum variance it searches amongst all possible d × d {\displaystyle d\times d} subregions. This means the central pixel of the window will be assigned the mean of the one subregion out of a poss

    Read more →
  • Prototype methods

    Prototype methods

    Prototype methods are machine learning methods that use data prototypes. A data prototype is a data value that reflects other values in its class, e.g., the centroid in a K-means clustering problem. == Methods == The following are some prototype methods K-means clustering Learning vector quantization (LVQ) Gaussian mixtures == Related Methods == While K-nearest neighbor's does not use prototypes, it is similar to prototype methods like K-means clustering.

    Read more →
  • Kernel principal component analysis

    Kernel principal component analysis

    In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space. == Background: Linear PCA == Recall that conventional PCA operates on zero-centered data; that is, 1 N ∑ i = 1 N x i = 0 {\displaystyle {\frac {1}{N}}\sum _{i=1}^{N}\mathbf {x} _{i}=\mathbf {0} } , where x i {\displaystyle \mathbf {x} _{i}} is one of the N {\displaystyle N} multivariate observations. It operates by diagonalizing the covariance matrix, C = 1 N ∑ i = 1 N x i x i ⊤ {\displaystyle C={\frac {1}{N}}\sum _{i=1}^{N}\mathbf {x} _{i}\mathbf {x} _{i}^{\top }} in other words, it gives an eigendecomposition of the covariance matrix: λ v = C v {\displaystyle \lambda \mathbf {v} =C\mathbf {v} } which can be rewritten as λ x i ⊤ v = x i ⊤ C v for i = 1 , … , N {\displaystyle \lambda \mathbf {x} _{i}^{\top }\mathbf {v} =\mathbf {x} _{i}^{\top }C\mathbf {v} \quad {\textrm {for}}~i=1,\ldots ,N} . (See also: Covariance matrix as a linear operator) == Introduction of the Kernel to PCA == To understand the utility of kernel PCA, particularly for clustering, observe that, while N points cannot, in general, be linearly separated in d < N {\displaystyle d Read more →