AI Chatbot Interface

AI Chatbot Interface — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Vicuna LLM

    Vicuna LLM

    Vicuna LLM is an omnibus large language model used in AI research. Its methodology is to enable the public at large to contrast and compare the accuracy of LLMs "in the wild" (an example of citizen science) and to vote on their output; a question-and-answer chat format is used. At the beginning of each round two LLM chatbots from a diverse pool of nine are presented randomly and anonymously, their identities only being revealed upon voting on their answers. The user has the option of either replaying ("regenerating") a round, or beginning an entirely fresh one with new LLMs. (The user also has the option of choosing which LLMs to do battle.) Based on Llama 2, it is an open source project, and it itself has become the subject of academic research in the burgeoning field. A non-commercial, public demo of the Vicuna-13b model is available to access using LMSYS.

    Read more →
  • Least-squares support vector machine

    Least-squares support vector machine

    Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods. == From support-vector machine to least-squares support-vector machine == Given a training set { x i , y i } i = 1 N {\displaystyle \{x_{i},y_{i}\}_{i=1}^{N}} with input data x i ∈ R n {\displaystyle x_{i}\in \mathbb {R} ^{n}} and corresponding binary class labels y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} , the SVM classifier, according to Vapnik's original formulation, satisfies the following conditions: { w T ϕ ( x i ) + b ≥ 1 , if y i = + 1 , w T ϕ ( x i ) + b ≤ − 1 , if y i = − 1 , {\displaystyle {\begin{cases}w^{T}\phi (x_{i})+b\geq 1,&{\text{if }}\quad y_{i}=+1,\\w^{T}\phi (x_{i})+b\leq -1,&{\text{if }}\quad y_{i}=-1,\end{cases}}} which is equivalent to y i [ w T ϕ ( x i ) + b ] ≥ 1 , i = 1 , … , N , {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1,\quad i=1,\ldots ,N,} where ϕ ( x ) {\displaystyle \phi (x)} is the nonlinear map from original space to the high- or infinite-dimensional space. === Inseparable data === In case such a separating hyperplane does not exist, we introduce so-called slack variables ξ i {\displaystyle \xi _{i}} such that { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N . {\displaystyle {\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N.\end{cases}}} According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem: min J 1 ( w , ξ ) = 1 2 w T w + c ∑ i = 1 N ξ i , {\displaystyle \min J_{1}(w,\xi )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}\xi _{i},} Subject to { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N , {\displaystyle {\text{Subject to }}{\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N,\end{cases}}} To solve this problem, we could construct the Lagrangian function: L 1 ( w , b , ξ , α , β ) = 1 2 w T w + c ∑ i = 1 N ξ i − ∑ i = 1 N α i { y i [ w T ϕ ( x i ) + b ] − 1 + ξ i } − ∑ i = 1 N β i ξ i , {\displaystyle L_{1}(w,b,\xi ,\alpha ,\beta )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}{\xi _{i}}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{y_{i}\left[{w^{T}\phi (x_{i})+b}\right]-1+\xi _{i}\right\}-\sum \limits _{i=1}^{N}\beta _{i}\xi _{i},} where α i ≥ 0 , β i ≥ 0 ( i = 1 , … , N ) {\displaystyle \alpha _{i}\geq 0,\ \beta _{i}\geq 0\ (i=1,\ldots ,N)} are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain By substituting w {\displaystyle w} by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: max Q 1 ( α ) = − 1 2 ∑ i , j = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i , {\displaystyle \max Q_{1}(\alpha )=-{\frac {1}{2}}\sum \limits _{i,j=1}^{N}{\alpha _{i}\alpha _{j}y_{i}y_{j}K(x_{i},x_{j})}+\sum \limits _{i=1}^{N}\alpha _{i},} where K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ {\displaystyle K(x_{i},x_{j})=\left\langle \phi (x_{i}),\phi (x_{j})\right\rangle } is called the kernel function. Solving this QP problem subject to constraints in (1), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space. === Least-squares SVM formulation === The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as min J 2 ( w , b , e ) = μ 2 w T w + ζ 2 ∑ i = 1 N e i 2 , {\displaystyle \min J_{2}(w,b,e)={\frac {\mu }{2}}w^{T}w+{\frac {\zeta }{2}}\sum \limits _{i=1}^{N}e_{i}^{2},} subject to the equality constraints y i [ w T ϕ ( x i ) + b ] = 1 − e i , i = 1 , … , N . {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]=1-e_{i},\quad i=1,\ldots ,N.} The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets y i = ± 1 {\displaystyle y_{i}=\pm 1} . Using y i 2 = 1 {\displaystyle y_{i}^{2}=1} , we have ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i e i ) 2 = ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 , {\displaystyle \sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}(y_{i}e_{i})^{2}=\sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2},} with e i = y i − ( w T ϕ ( x i ) + b ) . {\displaystyle e_{i}=y_{i}-(w^{T}\phi (x_{i})+b).} Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case. Hence the LS-SVM classifier formulation is equivalent to J 2 ( w , b , e ) = μ E W + ζ E D {\displaystyle J_{2}(w,b,e)=\mu E_{W}+\zeta E_{D}} with E W = 1 2 w T w {\displaystyle E_{W}={\frac {1}{2}}w^{T}w} and E D = 1 2 ∑ i = 1 N e i 2 = 1 2 ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 . {\displaystyle E_{D}={\frac {1}{2}}\sum \limits _{i=1}^{N}e_{i}^{2}={\frac {1}{2}}\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2}.} Both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio γ = ζ / μ {\displaystyle \gamma =\zeta /\mu } , therefore the original formulation uses only γ {\displaystyle \gamma } as tuning parameter. We use both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } as parameters in order to provide a Bayesian interpretation to LS-SVM. The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: { L 2 ( w , b , e , α ) = J 2 ( w , e ) − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , = 1 2 w T w + γ 2 ∑ i = 1 N e i 2 − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , {\displaystyle {\begin{cases}L_{2}(w,b,e,\alpha )\;=J_{2}(w,e)-\sum \limits _{i=1}^{N}\alpha _{i}\left\{{\left[{w^{T}\phi (x_{i})+b}\right]+e_{i}-y_{i}}\right\},\\\quad \quad \quad \quad \quad \;={\frac {1}{2}}w^{T}w+{\frac {\gamma }{2}}\sum \limits _{i=1}^{N}e_{i}^{2}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{\left[w^{T}\phi (x_{i})+b\right]+e_{i}-y_{i}\right\},\end{cases}}} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } are the Lagrange multipliers. The conditions for optimality are { ∂ L 2 ∂ w = 0 → w = ∑ i = 1 N α i ϕ ( x i ) , ∂ L 2 ∂ b = 0 → ∑ i = 1 N α i = 0 , ∂ L 2 ∂ e i = 0 → α i = γ e i , i = 1 , … , N , ∂ L 2 ∂ α i = 0 → y i = w T ϕ ( x i ) + b + e i , i = 1 , … , N . {\displaystyle {\begin{cases}{\frac {\partial L_{2}}{\partial w}}=0\quad \to \quad w=\sum \limits _{i=1}^{N}\alpha _{i}\phi (x_{i}),\\{\frac {\partial L_{2}}{\partial b}}=0\quad \to \quad \sum \limits _{i=1}^{N}\alpha _{i}=0,\\{\frac {\partial L_{2}}{\partial e_{i}}}=0\quad \to \quad \alpha _{i}=\gamma e_{i},\;i=1,\ldots ,N,\\{\frac {\partial L_{2}}{\partial \alpha _{i}}}=0\quad \to \quad y_{i}=w^{T}\phi (x_{i})+b+e_{i},\,i=1,\ldots ,N.\end{cases}}} Elimination of w {\displaystyle w} and e {\displaystyle e} will yield a linear system instead of a quadratic programming problem: [ 0 1 N T 1 N Ω + γ − 1 I N ] [ b α ] = [ 0 Y ] , {\displaystyle \left[{\begin{matrix}0&1_{N}^{T}\\1_{N}&\Omega +\gamma ^{-1}I_{N}\end{matrix}}\right]\left[{\begin{matrix}b\\\alpha \end{matrix}}\right]=\left[{\begin{matrix}0\\Y\end{matrix}}\right],} with Y = [ y 1 , … , y N ] T {\displaystyle Y=[y_{1},\ldots ,y_{N}]^{T}} , 1 N = [ 1 , … , 1 ] T {\displaystyle 1_{N}=[1,\ldots ,1]^{T}} and α = [ α 1 , … , α N ] T {\displaystyle \alpha =[\alpha _{1},\ldots ,\alpha _{N}]^{T}} . Here, I N {\displaystyle I_{N}} is an N × N {\displaystyle N\times N} identity matrix, and Ω ∈ R N × N {\displaystyle \Omega \in \mathbb {R} ^{N\times N}} is the kernel matrix defined by Ω i j = ϕ ( x i ) T ϕ ( x j ) = K ( x i , x j ) {\displaystyle \Omega _{ij}=\phi (x_{i})^{T}\phi (x_{j})=K(x_{i},x_{j})} . === Kernel function K === For the kernel function K(•, •) one typically has the following choices: Linear kernel : K ( x , x i ) = x i T x , {\displaystyle K(x,x_{i})=x_{i}^{T}x,} Polynomial kernel of degree d {\displaystyle d} : K ( x , x i ) = ( 1 + x i T x / c ) d , {\displaystyle K(x,x_{i})=\left({1+x_{i}^{T}x/c}\right)^{d},} Radial basis function RBF kernel : K ( x , x i ) = exp ⁡ ( − ‖ x − x i ‖ 2 / σ 2 ) , {\displaystyle K(x,x_{i})=\exp \left({-\left\|{x-x_{i}}\right\|^{2}/\sigma ^{2}}\right),} MLP kernel : K ( x , x i ) = tanh ⁡ ( k x i T x + θ ) , {\displaystyle K(x,x_{i})=\tanh \left({k

    Read more →
  • Distribution learning theory

    Distribution learning theory

    The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg ⁡ min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning fram

    Read more →
  • Feature selection

    Feature selection

    In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: simplification of models to make them easier to interpret, shorter training times, to avoid the curse of dimensionality, improve the compatibility of the data with a certain learning model class, to encode inherent symmetries present in the input space. The central premise when using feature selection is that data sometimes contains features that are redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature extraction creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points). == Introduction == A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information, the pointwise mutual information, Pearson product-moment correlation coefficient, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples; Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. == Subset selection == Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: Exhaustive Best first Simulated annealing Genetic algorithm Greedy forward selection Greedy backward elimination Particle swarm optimization Targeted projection pursuit Scatter search Variable neighborhood search Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see here. Other available filter metrics include: Class separability Error probability Inter-class distance Probabilistic distance Entropy Consistency-based feature selection Correlation-based feature selection == Optimality criteria == The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's Cp, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle. Other criteria are Bayesian information criterion (BIC), which uses a penalty of log ⁡ n {\displaystyle {\sqrt {\log {n}}}} for each added feature, minimum description length (MDL) which asymptotically uses log ⁡ n {\displaystyle {\sqrt {\log {n}}}} , Bonferroni / RIC which use 2 log ⁡ p {\displaystyle {\sqrt {2\log {p}}}} , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to 2 log ⁡ p q {\displaystyle {\sqrt {2\log {\frac {p}{q}}}}} . A maximum entropy rate criterion may also be used to select the most relevant subset of features. == Structure learning == Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms

    Read more →
  • Drop shadow

    Drop shadow

    In graphic design and computer graphics, a drop shadow is a visual effect consisting of a drawing element which looks like the shadow of an object, giving the impression that the object is raised above the objects behind it. The drop shadow is often used for elements of a graphical user interface such as windows or menus, and for simple text. The text label for icons on desktops in many desktop environments has a drop shadow, as this effect effectively distinguishes the text from any colored background it may be in front of. A simple way of drawing a drop shadow of a rectangular object is to draw a gray or black area underneath and offset from the object. In general, a drop shadow is a copy in black or gray of the object, drawn in a slightly different position. Realism may be increased by: Darkening the colors of the pixels where the shadow casts instead of making them gray. This can be done with alpha blending the shadow with the area it is cast on. Softening the edges of the shadow. This can be done by adding Gaussian blur to the shadow's alpha channel before blending. Inset drop shadows are a type which draws the shadows inside the element. This allows the interface element to appear as if it is sunken into the interface. == Photo editing == In photo editing or photography post-production, a drop shadow may be added right beneath a model or product in the image. It is used to create contrast between the background and the subject. To add a drop shadow, retouchers use graphic editing tools like Adobe Photoshop. Drop shadows are often used as a visual effect in e-commerce. This is done to improve the presentation of product images and create depth in the image. == Use == Generally, window managers which are capable of compositing allow drop shadow effects, whereas incapable window managers do not. In some operating systems like macOS, drop shadow is used to differentiate between active and inactive windows. Websites are able to use drop shadow effects through the CSS properties box-shadow, text-shadow, and drop-shadow() filter function in filter. The first two are used for elements and text respectively, while the filter applies to the element's content, letting it support oddly shaped elements or transparent images.

    Read more →
  • Independent component analysis

    Independent component analysis

    In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room. == Introduction == Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the numerical addition, at each time t, of signals from several sources. The question then is whether it is possible to separate these contributing sources from the observed total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results. It is also used for signals that are not supposed to be generated by mixing for analysis purposes. A simple application of ICA is the "cocktail party problem", where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated. Mixing weights for constructing the M {\textstyle M} observed signals from the N {\textstyle N} components can be placed in an M × N {\textstyle M\times N} matrix. An important thing to consider is that if N {\textstyle N} sources are present, at least N {\textstyle N} observations (e.g. microphones if the observed signal is audio) are needed to recover the original signals. When there are an equal number of observations and source signals, the mixing matrix is square ( M = N {\textstyle M=N} ). Other cases of underdetermined ( M < N {\textstyle M N {\textstyle M>N} ) have been investigated. The success of ICA separation of mixed signals relies on two assumptions and three effects of mixing source signals. Two assumptions: The source signals are independent of each other. The values in each source signal have non-Gaussian distributions. Three effects of mixing source signals: Independence: As per assumption 1, the source signals are independent; however, their signal mixtures are not. This is because the signal mixtures share the same source signals. Normality: According to the Central Limit Theorem, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution.Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables. Here we consider the value of each signal as the random variable. Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal. Those principles contribute to the basic establishment of ICA. If the signals extracted from a set of mixtures are independent and have non-Gaussian distributions or have low complexity, then they must be source signals. Another common example is image steganography, where ICA is used to embed one image within another. For instance, two grayscale images can be linearly combined to create mixed images in which the hidden content is visually imperceptible. ICA can then be used to recover the original source images from the mixtures. This technique underlies digital watermarking, which allows the embedding of ownership information into images, as well as more covert applications such as undetected information transmission. The method has even been linked to real-world cyberespionage cases. In such applications, ICA serves to unmix the data based on statistical independence, making it possible to extract hidden components that are not apparent in the observed data. Steganographic techniques, including those potentially involving ICA-based analysis, have been used in real-world cyberespionage cases. In 2010, the FBI uncovered a Russian spy network known as the "Illegals Program" (Operation Ghost Stories), where agents used custom-built steganography tools to conceal encrypted text messages within image files shared online. In another case, a former General Electric engineer, Xiaoqing Zheng, was convicted in 2022 for economic espionage. Zheng used steganography to exfiltrate sensitive turbine technology by embedding proprietary data within image files for transfer to entities in China. == Defining component independence == ICA finds the independent components (also called factors, latent variables or sources) by maximizing the statistical independence of the estimated components. We may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The two broadest definitions of independence for ICA are Minimization of mutual information Maximization of non-Gaussianity The Minimization-of-Mutual information (MMI) family of ICA algorithms uses measures like Kullback-Leibler Divergence and maximum entropy. The non-Gaussianity family of ICA algorithms, motivated by the central limit theorem, uses kurtosis and negentropy. Typical algorithms for ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigenvalue decomposition), and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. == Mathematical definitions == Linear independent component analysis can be divided into noiseless and noisy cases, where noiseless ICA is a special case of noisy ICA. Nonlinear ICA should be considered as a separate case. === General Derivation === In the classical ICA model, it is assumed that the observed data x i ∈ R m {\displaystyle \mathbf {x} _{i}\in \mathbb {R} ^{m}} at time t i {\displaystyle t_{i}} is generated from source signals s i ∈ R m {\displaystyle \mathbf {s} _{i}\in \mathbb {R} ^{m}} via a linear transformation x i = A s i {\displaystyle \mathbf {x} _{i}=A\mathbf {s} _{i}} , where A {\displaystyle A} is an unknown, invertible mixing matrix. To recover the source signals, the data is first centered (zero mean), and then whitened so that the transformed data has unit covariance. This whitening reduces the problem from estimating a general matrix A {\displaystyle A} to estimating an orthogonal matrix V {\displaystyle V} , significantly simplifying the search for independent components. If the covariance matrix of the centered data is Σ x = A A ⊤ {\displaystyle \Sigma _{x}=AA^{\top }} , then using the eigen-decomposition Σ x = Q D Q ⊤ {\displaystyle \Sigma _{x}=QDQ^{\top }} , the whitening transformation can be taken as D − 1 / 2 Q ⊤ {\displaystyle D^{-1/2}Q^{\top }} . This step ensures that the recovered sources are uncorrelated and of unit variance, leaving only the task of rotating the whitened data to maximize statistical independence. This general derivation underlies many ICA algorithms and is foundational in understanding the ICA model. ==== Reduced Mixing Problem ==== Independent component analysis (ICA) addresses the problem of recovering a set of unobserved source signals s i = ( s i 1 , s i 2 , … , s i m ) T {\displaystyle s_{i}=(s_{i1},s_{i2},\dots ,s_{im})^{T}} from observed mixed signals x i = ( x i 1 , x i 2 , … , x i m ) T {\displaystyle x_{i}=(x_{i1},x_{i2},\dots ,x_{im})^{T}} , based on the linear mixing model: x i = A s i , {\displaystyle x_{i}=A\,s_{i},} where the A {\displaystyle A} is an m × m {\displaystyle m\times m} invertible matrix called the mixing matrix, s i {\displaystyle s_{i}} represents the m‑dimensional vector containing the values of the sources at time t i {\displaystyle t_{i}} , and x i {\displaystyle x_{i}} is the corresponding vector of observed values at time t i {\displaystyle t_{i}} . The goal is to estimate both A {\displaystyle A} and the source signals { s i } {\displaystyle \{s_{i}\}} solely from the observed data { x i } {\displaystyle \{x_{i}\}} . After centering, the Gram matrix is computed as: ( X ∗ ) T X ∗ = Q D Q T , {\displaystyle (X^{})^{T}X^{}=Q\,D\,Q^{T},} where D is a diagonal matrix with positive entries (assuming X ∗ {\displaystyle X^{}} has maximum rank), and Q is an orthogonal matrix. Writing the SVD of the mixing matrix A = U Σ V T {\displaystyle A=U\Sigma V^{T}} and comparing with A A T = U Σ 2 U T {\displaystyle AA^{T}=U\Sigma ^{2}U^{T}} the mixing A has the form A = Q D 1 / 2 V T . {\displaystyle A=Q\,D^{1/2}\,V^{T}.} So, the normalized source values satisfy s i ∗ = V y i ∗ {\displaystyle s_{i}^{}=V\,y_{i}^{}} , where y i ∗ = D − 1 2 Q T x i ∗ . {\displaystyle y_{i}^{}=D^{-{\tfrac {1}{2}}}Q^{T}x_{i}^{}.} Thus, ICA reduces

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →
  • Cloud-based integration

    Cloud-based integration

    Cloud-based integration is a form of systems integration business delivered as a cloud computing service that addresses data, process, service-oriented architecture (SOA) and application integration. == Description == Integration platform as a service (iPaaS) is a suite of cloud services enabling customers to develop, execute and govern integration flows between disparate applications. Under the cloud-based iPaaS integration model, customers drive the development and deployment of integrations without installing or managing any hardware or middleware. The iPaaS model allows businesses to achieve integration without big investment into skills or licensed middleware software. iPaaS used to be regarded primarily as an integration tool for cloud-based software applications, used mainly by small to mid-sized business. Over time, a hybrid type of iPaaS—hybrid-IT iPaaS—that connects cloud to on-premises, is becoming increasingly popular. Additionally, large enterprises are exploring new ways of integrating iPaaS into their existing IT infrastructures. Cloud integration was created to break down the data silos, improve connectivity and optimize the business process. Cloud integration has increased in popularity as the usage of Software as a Service solutions has grown. Prior to the emergence of cloud computing in the early 2000s, integration could be categorized as either internal or business to business (B2B). Internal integration requirements were serviced through an on-premises middleware platform and typically utilized a service bus to manage exchange of data between systems. B2B integration was serviced through EDI gateways or value-added network (VAN). The advent of SaaS applications created a new kind of demand which was met through cloud-based integration. Since their emergence, many such services have also developed the capability to integrate legacy or on-premises applications, as well as function as EDI gateways. The following essential features were proposed by one marketing company: Deployed on a multi-tenant, elastic cloud infrastructure Subscription model pricing (operating expense, not capital expenditure) No software development (required connectors should already be available) Users do not perform deployment or manage the platform itself Presence of integration management and monitoring features The emergence of this sector led to new cloud-based business process management tools that do not need to build integration layers - since those are now a separate service. Drivers of growth include the need to integrate mobile app capabilities with proliferating API publishing resources and the growth in demand for the Internet of things functionalities as more 'things' connect to the Internet.

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Conference on Computer Vision and Pattern Recognition

    Conference on Computer Vision and Pattern Recognition

    The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.

    Read more →
  • International Clinical Trials Registry Platform

    International Clinical Trials Registry Platform

    The International Clinical Trials Registry Platform (ICTRP) is a platform for the registration of clinical trials operated by the World Health Organization. The ICTRP combines data from multiple cooperating clinical trials registries to generate a global view of clinical trials worldwide, with a search portal that allows access to the entire dataset. It requires a minimum standard set of database fields, the WHO Trial Registration Data Set, to be present for a trial to be registered. All entries are given a Universal Trial Number (UTN) that identifies them uniquely. The organization has sought to assist various national governments in establishing their own clinical trials databases. It combines data from the following primary registries and data providers: Australian New Zealand Clinical Trials Registry (ANZCTR) Brazilian Clinical Trials Registry (ReBec) Chinese Clinical Trial Registry (ChiCTR) Clinical Research Information Service (CRiS), Republic of Korea ClinicalTrials.gov Clinical Trials Information System (CTIS), European Medicines Agency Clinical Trials Registry - India (CTRI) Cuban Public Registry of Clinical Trials (RPCEC) EU Clinical Trials Register (EU-CTR) German Clinical Trials Register (DRKS) Iranian Registry of Clinical Trials (IRCT) ISRCTN (UK) International Traditional Medicine Clinical Trial Registry (ITMCTR) Japan Registry of Clinical Trials (jRCT) Japan Primary Registries Network (JPRN) Lebanese Clinical Trials Registry (LBCTR) Overview of Medical Research in the Netherlands (OMON) Thai Clinical Trials Registry (TCTR) Pan African Clinical Trial Registry (PACTR) Peruvian Clinical Trial Registry (REPEC) Sri Lanka Clinical Trials Registry (SLCTR)

    Read more →
  • Sliced inverse regression

    Sliced inverse regression

    Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics. In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x _ {\displaystyle {\underline {x}}} , which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing. As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} , to perform a weighted principal component analysis. == Model == Given a response variable Y {\displaystyle \,Y} and a (random) vector X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} of explanatory variables, SIR is based on the model Y = f ( β 1 ⊤ X , … , β k ⊤ X , ε ) ( 1 ) {\displaystyle Y=f(\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X,\varepsilon )\quad \quad \quad \quad \quad (1)} where β 1 , … , β k {\displaystyle \beta _{1},\ldots ,\beta _{k}} are unknown projection vectors, k {\displaystyle \,k} is an unknown number smaller than p {\displaystyle \,p} , f {\displaystyle \;f} is an unknown function on R k + 1 {\displaystyle \mathbb {R} ^{k+1}} as it only depends on k {\displaystyle \,k} arguments, and ε {\displaystyle \varepsilon } is a random variable representing error with E [ ε | X ] = 0 {\displaystyle E[\varepsilon |X]=0} and a finite variance of σ 2 {\displaystyle \sigma ^{2}} . The model describes an ideal solution, where Y {\displaystyle \,Y} depends on X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} only through a k {\displaystyle \,k} dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from p {\displaystyle \,p} to a smaller number k {\displaystyle \,k} without losing any information. An equivalent version of ( 1 ) {\displaystyle \,(1)} is: the conditional distribution of Y {\displaystyle \,Y} given X {\displaystyle \,X} depends on X {\displaystyle \,X} only through the k {\displaystyle \,k} dimensional random vector ( β 1 ⊤ X , … , β k ⊤ X ) {\displaystyle (\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X)} . It is assumed that this reduced vector is as informative as the original X {\displaystyle \,X} in explaining Y {\displaystyle \,Y} . The unknown β i ′ s {\displaystyle \,\beta _{i}'s} are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space). == Relevant linear algebra background == Given a _ 1 , … , a _ r ∈ R n {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}\in \mathbb {R} ^{n}} , then V := L ( a _ 1 , … , a _ r ) {\displaystyle V:=L({\underline {a}}_{1},\ldots ,{\underline {a}}_{r})} , the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a _ 1 , … , a _ r {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}} span V {\displaystyle \,V} , but the vectors that span space V {\displaystyle \,V} are not unique. The dimension of V ( ∈ R n ) {\displaystyle \,V(\in \mathbb {R} ^{n})} is equal to the maximum number of linearly independent vectors in V {\displaystyle \,V} . A set of n {\displaystyle \,n} linear independent vectors of R n {\displaystyle \mathbb {R} ^{n}} makes up a basis of R n {\displaystyle \mathbb {R} ^{n}} . The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line. == Inverse regression == Computing the inverse regression curve (IR) means instead of looking for E [ Y | X = x ] {\displaystyle \,E[Y|X=x]} , which is a curve in R p {\displaystyle \mathbb {R} ^{p}} it is actually E [ X | Y = y ] {\displaystyle \,E[X|Y=y]} , which is also a curve in R p {\displaystyle \mathbb {R} ^{p}} , but consisting of p {\displaystyle \,p} one-dimensional regressions. The center of the inverse regression curve is located at E [ E [ X | Y ] ] = E [ X ] {\displaystyle \,E[E[X|Y]]=E[X]} . Therefore, the centered inverse regression curve is E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} which is a p {\displaystyle \,p} dimensional curve in R p {\displaystyle \mathbb {R} ^{p}} . == Inverse regression versus dimension reduction == The centered inverse regression curve lies on a k {\displaystyle \,k} -dimensional subspace spanned by Σ x x β i ′ s {\displaystyle \,\Sigma _{xx}\beta _{i}\,'s} . This is a connection between the model and inverse regression. Given this condition and ( 1 ) {\displaystyle \,(1)} , the centered inverse regression curve E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} is contained in the linear subspace spanned by Σ x x β k ( k = 1 , … , K ) {\displaystyle \,\Sigma _{xx}\beta _{k}(k=1,\ldots ,K)} , where Σ x x = C o v ( X ) {\displaystyle \,\Sigma _{xx}=Cov(X)} . == Estimation of the EDR-directions == After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m ^ h ′ s {\displaystyle \,{\hat {m}}_{h}\,'s} , X {\displaystyle \,X} would have been standardized to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} . Corresponding to the theorem above, the IR-curve m 1 ( y ) = E [ Z | Y = y ] {\displaystyle \,m_{1}(y)=E[Z|Y=y]} lies in the space spanned by ( η 1 , … , η k ) {\displaystyle \,(\eta _{1},\ldots ,\eta _{k})} , where η i = Σ x x 1 / 2 β i {\displaystyle \,\eta _{i}=\Sigma _{xx}^{1/2}\beta _{i}} . As a consequence, the covariance matrix c o v [ E [ Z | Y ] ] {\displaystyle \,cov[E[Z|Y]]} is degenerate in any direction orthogonal to the η i ′ s {\displaystyle \,\eta _{i}\,'s} . Therefore, the eigenvectors η k ( k = 1 , … , K ) {\displaystyle \,\eta _{k}(k=1,\ldots ,K)} associated with the largest K {\displaystyle \,K} eigenvalues are the standardized EDR-directions. == Algorithm == === SIR algorithm === The algorithm from Li, K-C. (1991) to estimate the EDR-directions via SIR is as follows. 1. Let Σ x x {\displaystyle \,\Sigma _{xx}} be the covariance matrix of X {\displaystyle \,X} . Standardize X {\displaystyle \,X} to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} ( 1 ) {\displaystyle \,(1)} can also be rewritten as Y = f ( η 1 ⊤ Z , … , η k ⊤ Z , ε ) {\displaystyle Y=f(\eta _{1}^{\top }Z,\ldots ,\eta _{k}^{\top }Z,\varepsilon )} where η k = β k Σ x x 1 / 2 ∀ k {\displaystyle \,\eta _{k}=\beta _{k}\Sigma _{xx}^{1/2}\quad \forall \;k} .) 2. Divide the range of y i {\displaystyle \,y_{i}} into S {\displaystyle \,S} non-overlapping slices H s ( s = 1 , … , S ) . n s {\displaystyle \,H_{s}(s=1,\ldots ,S).\;n_{s}} is the number of observations within each slice and I H s {\displaystyle \,I_{H_{s}}} is the indicator function for the slice: n s = ∑ i = 1 n I H s ( y i ) {\displaystyle n_{s}=\sum _{i=1}^{n}I_{H_{s}}(y_{i})} 3. Compute the mean of z i {\displaystyle \,z_{i}} over all slices, which is a crude estimate m ^ 1 {\displaystyle \,{\hat {m}}_{1}} of the inverse regression curve m 1 {\displaystyle \,m_{1}} : z ¯ s = n s − 1 ∑ i = 1 n z i I H s ( y i ) {\displaystyle \,{\bar {z}}_{s}=n_{s}^{-1}\sum _{i=1}^{n}z_{i}I_{H_{s}}(y_{i})} 4. Calculate the estimate for C o v { m 1 ( y ) } {\displaystyle \,Cov\{m_{1}(y)\}} : V ^ = n − 1 ∑ i = 1 S n s z ¯ s z ¯ s ⊤ {\displaystyle \,{\hat {V}}=n^{-1}\sum _{i=1}^{S}n_{s}{\bar {z}}_{s}{\bar {z}}_{s}^{\top }} 5. Identify the eigenvalues λ ^ i {\displaystyle \,{\hat {\lambda }}_{i}} and the eigenvectors η ^ i {\displaystyle \,{\hat {\eta }}_{i}} of V ^ {\displaystyle \,{\hat {V}}} , which are the standardized EDR-directions. 6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by: β ^ i = Σ ^ x x − 1 / 2 η ^ i {\displaystyle \,{\hat {\beta }}_{i}={\hat {\Sigma }}_{xx}^{-1/2}{\hat {\eta }}_{i}} (which are not necessarily orthogonal.)

    Read more →
  • FastICA

    FastICA

    FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration. == Algorithm == === Prewhitening the data === Let the X := ( x i j ) ∈ R N × M {\displaystyle \mathbf {X} :=(x_{ij})\in \mathbb {R} ^{N\times M}} denote the input data matrix, M {\displaystyle M} the number of columns corresponding with the number of samples of mixed signals and N {\displaystyle N} the number of rows corresponding with the number of independent source signals. The input data matrix X {\displaystyle \mathbf {X} } must be prewhitened, or centered and whitened, before applying the FastICA algorithm to it. Centering the data entails demeaning each component of the input data X {\displaystyle \mathbf {X} } , that is, for each i = 1 , … , N {\displaystyle i=1,\ldots ,N} and j = 1 , … , M {\displaystyle j=1,\ldots ,M} . After centering, each row of X {\displaystyle \mathbf {X} } has an expected value of 0 {\displaystyle 0} . Whitening the data requires a linear transformation L : R N × M → R N × M {\displaystyle \mathbf {L} :\mathbb {R} ^{N\times M}\to \mathbb {R} ^{N\times M}} of the centered data so that the components of L ( X ) {\displaystyle \mathbf {L} (\mathbf {X} )} are uncorrelated and have variance one. More precisely, if X {\displaystyle \mathbf {X} } is a centered data matrix, the covariance of L x := L ( X ) {\displaystyle \mathbf {L} _{\mathbf {x} }:=\mathbf {L} (\mathbf {X} )} is the ( N × N ) {\displaystyle (N\times N)} -dimensional identity matrix, that is, A common method for whitening is by performing an eigenvalue decomposition on the covariance matrix of the centered data X {\displaystyle \mathbf {X} } , E { X X T } = E D E T {\displaystyle E\left\{\mathbf {X} \mathbf {X} ^{T}\right\}=\mathbf {E} \mathbf {D} \mathbf {E} ^{T}} , where E {\displaystyle \mathbf {E} } is the matrix of eigenvectors and D {\displaystyle \mathbf {D} } is the diagonal matrix of eigenvalues. The whitened data matrix is defined thus by === Single component extraction === The iterative algorithm finds the direction for the weight vector w ∈ R N {\displaystyle \mathbf {w} \in \mathbb {R} ^{N}} that maximizes a measure of non-Gaussianity of the projection w T X {\displaystyle \mathbf {w} ^{T}\mathbf {X} } , with X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} denoting a prewhitened data matrix as described above. Note that w {\displaystyle \mathbf {w} } is a column vector. To measure non-Gaussianity, FastICA relies on a nonquadratic nonlinear function f ( u ) {\displaystyle f(u)} , its first derivative g ( u ) {\displaystyle g(u)} , and its second derivative g ′ ( u ) {\displaystyle g^{\prime }(u)} . Hyvärinen states that the functions are useful for general purposes, while may be highly robust. The steps for extracting the weight vector w {\displaystyle \mathbf {w} } for single component in FastICA are the following: Randomize the initial weight vector w {\displaystyle \mathbf {w} } Let w + ← E { X g ( w T X ) T } − E { g ′ ( w T X ) } w {\displaystyle \mathbf {w} ^{+}\leftarrow E\left\{\mathbf {X} g(\mathbf {w} ^{T}\mathbf {X} )^{T}\right\}-E\left\{g'(\mathbf {w} ^{T}\mathbf {X} )\right\}\mathbf {w} } , where E { . . . } {\displaystyle E\left\{...\right\}} means averaging over all column-vectors of matrix X {\displaystyle \mathbf {X} } Let w ← w + / ‖ w + ‖ {\displaystyle \mathbf {w} \leftarrow \mathbf {w} ^{+}/\|\mathbf {w} ^{+}\|} If not converged, go back to 2 === Multiple component extraction === The single unit iterative algorithm estimates only one weight vector which extracts a single component. Estimating additional components that are mutually "independent" requires repeating the algorithm to obtain linearly independent projection vectors - note that the notion of independence here refers to maximizing non-Gaussianity in the estimated components. Hyvärinen provides several ways of extracting multiple components with the simplest being the following. Here, 1 M {\displaystyle \mathbf {1_{M}} } is a column vector of 1's of dimension M {\displaystyle M} . Algorithm FastICA Input: C {\displaystyle C} Number of desired components Input: X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} Prewhitened matrix, where each column represents an N {\displaystyle N} -dimensional sample, where C <= N {\displaystyle C<=N} Output: W ∈ R N × C {\displaystyle \mathbf {W} \in \mathbb {R} ^{N\times C}} Un-mixing matrix where each column projects X {\displaystyle \mathbf {X} } onto independent component. Output: S ∈ R C × M {\displaystyle \mathbf {S} \in \mathbb {R} ^{C\times M}} Independent components matrix, with M {\displaystyle M} columns representing a sample with C {\displaystyle C} dimensions. for p in 1 to C: w p ← {\displaystyle \mathbf {w_{p}} \leftarrow } Random vector of length N while w p {\displaystyle \mathbf {w_{p}} } changes w p ← 1 M X g ( w p T X ) T − 1 M g ′ ( w p T X ) 1 M w p {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {1}{M}}\mathbf {X} g(\mathbf {w_{p}} ^{T}\mathbf {X} )^{T}-{\frac {1}{M}}g'(\mathbf {w_{p}} ^{T}\mathbf {X} )\mathbf {1_{M}} \mathbf {w_{p}} } w p ← w p − ∑ j = 1 p − 1 ( w p T w j ) w j {\displaystyle \mathbf {w_{p}} \leftarrow \mathbf {w_{p}} -\sum _{j=1}^{p-1}(\mathbf {w_{p}} ^{T}\mathbf {w_{j}} )\mathbf {w_{j}} } w p ← w p ‖ w p ‖ {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {\mathbf {w_{p}} }{\|\mathbf {w_{p}} \|}}} output W ← [ w 1 , … , w C ] {\displaystyle \mathbf {W} \leftarrow {\begin{bmatrix}\mathbf {w_{1}} ,\dots ,\mathbf {w_{C}} \end{bmatrix}}} output S ← W T X {\displaystyle \mathbf {S} \leftarrow \mathbf {W^{T}} \mathbf {X} }

    Read more →