AI For Business Georgia Tech

AI For Business Georgia Tech — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Socially assistive robot

    Socially assistive robot

    A socially assistive robot (SAR) aids users through social engagement and support rather than through physical tasks and interactions. == Background == The field of socially assistive robotics emerged in the early 2000s, following the emergence of the field of social robots. In contrast to social robots, SARs aid users with specific goals related to behavior change rather than serving as purely social entities. The term "Socially assistive robot" was initially defined by Maja Matarić and David Feil-Seifer in 2005. Since its inception, the field has gained substantial recognition, featuring numerous research projects, a wealth of global research publications, startup companies, and a growing array of products on the consumer market. The COVID-19 pandemic has underscored the immense potential of socially assistive robots, particularly in addressing the needs of large user populations, including children engaged in remote learning, elderly individuals grappling with loneliness, and those affected by social isolation and its associated negative consequences. == Characteristics of interaction == SARs rely on artificial intelligence (AI) to generate real-time, responsive, natural, and meaningful robot behaviors during interactions with humans. The robots employ various forms of communication, such as facial expressions, gestures, body movements, and speech. In contrast to robots intended for physical tasks, SARs are designed to support and motivate users to perform their own tasks. The tasks a user engages in can be physical (e.g., rehabilitation exercises for post-stroke users), cognitive (e.g., dementia screening for elderly users), or social (e.g., turn-taking for users with autism spectrum disorders). This complex interaction involves detecting and interpreting the user's movement, behavior, intent, goals, speech, and preferences. Machine learning and robot learning techniques are frequently employed to enhance the robot's understanding of the user, predict user preferences, and provide effective assistance. The effectiveness of socially assistive robots is assessed based on objective measurements of user performance and improvement resulting from the robot’s assistance and support. Unlike other branches of robotics, where effectiveness depends on the robot's physical task completion, SAR measures the success of the robot based on the user's progress and achievements. This evaluation is carried out using quantitative objective metrics, such as time spent on tasks, accuracy, retention, and verbalization, as well as quantitative subjective metrics, such as user survey tools. SAR is based on the large body of evidence showing that users tend to respond more positively to interactions with physical robots compared to interactions with screens. Interaction with physical robots also encourages users to learn and retain more information than screen-based interactions. This fundamental insight underlines why physical robots in SAR applications are more effective, as opposed to interactions solely involving screens, tablets, or computers. == Uses and applications == SARs have been developed and validated in a wide array of applications, including healthcare, elder care, education, and training. For example, SARs have been developed to support children on the autism spectrum in acquiring and practicing social and cognitive skills, to motivate and coach stroke patients throughout their rehabilitation exercises, monitoring individuals health (ex. fall detection), and to encourage elderly users to be more physically and socially active. There is a concern that technophobia and lack of trust in robots will pose a barrier to the effectiveness of SARs in older adults.

    Read more →
  • Accumulated local effects

    Accumulated local effects

    Accumulated local effects (ALE) is a machine learning interpretability method. == Concepts == ALE uses a conditional feature distribution as an input and generates augmented data, creating more realistic data than a marginal distribution. It ignores far out-of-distribution (outlier) values. Unlike partial dependence plots and marginal plots, ALE is not defeated in the presence of correlated predictors. It analyzes differences in predictions instead of averaging them by calculating the average of the differences in model predictions over the augmented data, instead of the average of the predictions themselves. == Example == Given a model that predicts house prices based on its distance from city center and size of the building area, ALE compares the differences of predictions of houses of different sizes. The result separates the impact of the size from otherwise correlated features. == Limitations == Defining evaluation windows is subjective. High correlations between features can defeat the technique. ALE requires more and more uniformly distributed observations than PDP so that the conditional distribution can be reliably determined. The technique may produce inadequate results if the data is highly sparse, which is more common with high-dimensional data (curse of dimensionality).

    Read more →
  • Relationship square

    Relationship square

    In statistics, the relationship square is a graphical representation for use in the factorial analysis of a table individuals x variables. This representation completes classical representations provided by principal component analysis (PCA) or multiple correspondence analysis (MCA), namely those of individuals, of quantitative variables (correlation circle) and of the categories of qualitative variables (at the centroid of the individuals who possess them). It is especially important in factor analysis of mixed data (FAMD) and in multiple factor analysis (MFA). == Definition of relationship square in the MCA frame == The first interest of the relationship square is to represent the variables themselves, not their categories, which is all the more valuable as there are many variables. For this, we calculate for each qualitative variable j {\displaystyle j} and each factor F s {\displaystyle F_{s}} ( F s {\displaystyle F_{s}} , rank s {\displaystyle s} factor, is the vector of coordinates of the individuals along the axis of rank s {\displaystyle s} ; in PCA, F s {\displaystyle F_{s}} is called principal component of rank s {\displaystyle s} ), the square of the correlation ratio between the F s {\displaystyle F_{s}} and the variable j {\displaystyle j} , usually denoted : η 2 ( j , F s ) {\displaystyle \eta ^{2}(j,F_{s})} Thus, to each factorial plane, we can associate a representation of qualitative variables themselves. Their coordinates being between 0 and 1, the variables appear in the square having as vertices the points (0,0), ( 0,1), (1,0) and (1,1). == Example in MCA == Six individuals ( i 1 , … , i 6 ) {\displaystyle i_{1},\ldots ,i_{6})} are described by three variables ( q 1 , q 2 , q 3 ) {\displaystyle (q_{1},q_{2},q_{3})} having respectively 3, 2 and 3 categories. Example : the individual i 1 {\displaystyle i_{1}} possesses the category a {\displaystyle a} of q 1 {\displaystyle q_{1}} , d {\displaystyle d} of q 2 {\displaystyle q_{2}} and f {\displaystyle f} of q 3 {\displaystyle q_{3}} . Applied to these data, the MCA function included in the R Package FactoMineR provides to the classical graph in Figure 1. The relationship square (Figure 2) makes easier the reading of the classic factorial plane. It indicates that: The first factor is related to the three variables but especially q 3 {\displaystyle q_{3}} (which have a very high coordinate along the first axis) and then q 2 {\displaystyle q_{2}} . The second factor is related only to q 1 {\displaystyle q_{1}} and q 3 {\displaystyle q_{3}} (and not to q 2 {\displaystyle q_{2}} which has a coordinate along axis 2 equal to 0) and that in a strong and equal manner. All this is visible on the classic graphic but not so clearly. The role of the relationship square is first to assist in reading a conventional graphic. This is precious when the variables are numerous and possess numerous coordinates. == Extensions == This representation may be supplemented with those of quantitative variables, the coordinates of the latter being the square of correlation coefficients (and not of correlation ratios). Thus, the second advantage of the relationship square lies in the ability to represent simultaneously quantitative and qualitative variables. The relationship square can be constructed from any factorial analysis of a table individuals x variables. In particular, it is (or should be) used systematically: in multiple correspondences analysis (MCA); in principal components analysis (PCA) when there are many supplementary variables; in factor analysis of mixed data (FAMD). An extension of this graphic to groups of variables (how to represent a group of variables by a single point ?) is used in Multiple Factor Analysis (MFA) == History == The idea of representing the qualitative variables themselves by a point (and not the categories) is due to Brigitte Escofier. The graphic as it is used now has been introduced by Brigitte Escofier and Jérôme Pagès in the framework of multiple factor analysis == Conclusion == In MCA, the relationship square provides a synthetic view of the connections between mixed variables, all the more valuable as there are many variables having many categories. This representation iscan be useful in any factorial analysis when there are numerous mixed variables, active and/or supplementary.

    Read more →
  • International Conference on Acoustics, Speech, and Signal Processing

    International Conference on Acoustics, Speech, and Signal Processing

    ICASSP, the International Conference on Acoustics, Speech, and Signal Processing, is an annual flagship conference organized by IEEE Signal Processing Society. Ei Compendex has indexed all papers included in its proceedings. The first ICASSP was held in 1976 in Philadelphia, Pennsylvania, based on the success of a conference in Massachusetts four years earlier that had focused specifically on speech signals. As ranked by Google Scholar's h-index metric in 2016, ICASSP has the highest h-index of any conference in the Signal Processing field. The Brazilian ministry of education gave the conference an 'A1' rating based on its h-index. == Conference list ==

    Read more →
  • Mistral Vibe

    Mistral Vibe

    Mistral Vibe or Vibe (Le Chat until May 2026), is a chatbot that uses generative artificial intelligence developed in France by Mistral AI. Mistral Vibe is available in iOS and Android. Its services are operated on a freemium model. == History == In February 2024, Mistral AI released Le Chat. In January 2025, Mistral AI made a content deal with Agence France-Presse (AFP) that lets Le Chat query AFP's entire archive dating back to 1983. On 6 February 2025, a mobile app for Le Chat was released for iOS and Android, and a subscription tier, Pro, was introduced at a cost of $14.99 per month. In July 2025, Mistral AI released Voxtral, an open-source language model that understands and generates audio. Mistral introduced a voice mode for chatting that uses Voxtral, and projects, which allows grouping chats and files. In September 2025, Le Chat introduced the capability to remember previous conversations. In May 2026, Mistral AI announced the rebrand from Le Chat to Mistral Vibe and new features were introduced at the same time.

    Read more →
  • Ordinal regression

    Ordinal regression

    In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference (on a scale from, say, 1–5 for "very poor" through "excellent"), as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning. == Linear models for ordinal regression == Ordinal regression can be performed using a generalized linear model (GLM) that fits both a coefficient vector and a set of thresholds to a dataset. Suppose one has a set of observations, represented by length-p vectors x1 through xn, with associated responses y1 through yn, where each yi is an ordinal variable on a scale 1, ..., K. For simplicity, and without loss of generality, we assume y is a non-decreasing vector, that is, yi ≤ {\displaystyle \leq } yi+1. To this data, one fits a length-p coefficient vector w and a set of thresholds θ1, ..., θK−1 with the property that θ1 < θ2 < ... < θK−1. This set of thresholds divides the real number line into K disjoint segments, corresponding to the K response levels. The model can now be formulated as Pr ( y ≤ i ∣ x ) = σ ( θ i − w ⋅ x ) {\displaystyle \Pr(y\leq i\mid \mathbf {x} )=\sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )} or, the cumulative probability of the response y being at most i is given by a function σ (the inverse link function) applied to a linear function of x. Several choices exist for σ; the logistic function σ ( θ i − w ⋅ x ) = 1 1 + e − ( θ i − w ⋅ x ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )={\frac {1}{1+e^{-(\theta _{i}-\mathbf {w} \cdot \mathbf {x} )}}}} gives the ordered logit model, while using the CDF of the standard normal distribution gives the ordered probit model. A third option is to use an exponential function σ ( θ i − w ⋅ x ) = 1 − exp ⁡ ( − exp ⁡ ( θ i − w ⋅ x ) ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )=1-\exp(-\exp(\theta _{i}-\mathbf {w} \cdot \mathbf {x} ))} which gives the proportional hazards model. === Latent variable model === The probit version of the above model can be justified by assuming the existence of a real-valued latent variable (unobserved quantity) y, determined by y ∗ = w ⋅ x + ε {\displaystyle y^{}=\mathbf {w} \cdot \mathbf {x} +\varepsilon } where ε is normally distributed with zero mean and unit variance, conditioned on x. The response variable y results from an "incomplete measurement" of y, where one only determines the interval into which y falls: y = { 1 if y ∗ ≤ θ 1 , 2 if θ 1 < y ∗ ≤ θ 2 , 3 if θ 2 < y ∗ ≤ θ 3 ⋮ K if θ K − 1 < y ∗ . {\displaystyle y={\begin{cases}1&{\text{if}}~~y^{}\leq \theta _{1},\\2&{\text{if}}~~\theta _{1} Read more →

  • Softmax function

    Softmax function

    The softmax function, also known as softargmax or normalized exponential function, converts a tuple of K real numbers into a probability distribution over K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes. == Definition == The softmax function takes as input a tuple z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ( 0 , 1 ) {\displaystyle (0,1)} , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Formally, the standard (unit) softmax function σ : R K → ( 0 , 1 ) K {\displaystyle \sigma :\mathbb {R} ^{K}\to (0,1)^{K}} , where ⁠ K > 1 {\displaystyle K>1} ⁠, takes a tuple z = ( z 1 , … , z K ) ∈ R K {\displaystyle \mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}} and computes each component of vector σ ( z ) ∈ ( 0 , 1 ) K {\displaystyle \sigma (\mathbf {z} )\in (0,1)^{K}} with σ ( z ) i = e z i ∑ j = 1 K e z j . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}\,.} In words, the softmax applies the standard exponential function to each element z i {\displaystyle z_{i}} of the input tuple z {\displaystyle \mathbf {z} } (consisting of K {\displaystyle K} real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector σ ( z ) {\displaystyle \sigma (\mathbf {z} )} is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of ( 1 , 2 , 8 ) {\displaystyle (1,2,8)} is approximately ( 0.001 , 0.002 , 0.997 ) {\displaystyle (0.001,0.002,0.997)} , which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8). In general, instead of e a different base b > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing b = e β {\displaystyle b=e^{\beta }} or b = e − β {\displaystyle b=e^{-\beta }} (for real β) yields the expressions: σ ( z ) i = e β z i ∑ j = 1 K e β z j or σ ( z ) i = e − β z i ∑ j = 1 K e − β z j for i = 1 , … , K . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.} A value proportional to the reciprocal of β is sometimes referred to as the temperature: β = 1 / k T {\textstyle \beta =1/kT} , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating. In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter β (or T) is varied. The softmax function is a multiple-variable generalization of the logistic function. == Interpretations == === Smooth arg max === The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity. Formally, instead of considering the arg max as a function with categorical output 1 , … , n {\displaystyle 1,\dots ,n} (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg): a r g m a x ⁡ ( z 1 , … , z n ) = ( y 1 , … , y n ) = ( 0 , … , 0 , 1 , 0 , … , 0 ) , {\displaystyle \operatorname {arg\,max} (z_{1},\,\dots ,\,z_{n})=(y_{1},\,\dots ,\,y_{n})=(0,\,\dots ,\,0,\,1,\,0,\,\dots ,\,0),} where the output coordinate y i = 1 {\displaystyle y_{i}=1} if and only if i {\displaystyle i} is the arg max of ( z 1 , … , z n ) {\displaystyle (z_{1},\dots ,z_{n})} , meaning z i {\displaystyle z_{i}} is the unique maximum value of ( z 1 , … , z n ) {\displaystyle (z_{1},\,\dots ,\,z_{n})} . For example, in this encoding a r g m a x ⁡ ( 1 , 5 , 10 ) = ( 0 , 0 , 1 ) , {\displaystyle \operatorname {arg\,max} (1,5,10)=(0,0,1),} since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal z i {\displaystyle z_{i}} being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, a r g m a x ⁡ ( 1 , 5 , 5 ) = ( 0 , 1 / 2 , 1 / 2 ) , {\displaystyle \operatorname {arg\,max} (1,\,5,\,5)=(0,\,1/2,\,1/2),} since the second and third argument are both the maximum. In case all arguments are equal, this is simply a r g m a x ⁡ ( z , … , z ) = ( 1 / n , … , 1 / n ) . {\displaystyle \operatorname {arg\,max} (z,\dots ,z)=(1/n,\dots ,1/n).} Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points. With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, σ β ( z ) → a r g m a x ⁡ ( z ) . {\displaystyle \sigma _{\beta }(\mathbf {z} )\to \operatorname {arg\,max} (\mathbf {z} ).} However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, σ β ( 1 , 1.0001 ) → ( 0 , 1 ) , {\displaystyle \sigma _{\beta }(1,\,1.0001)\to (0,1),} but σ β ( 1 , 0.9999 ) → ( 1 , 0 ) , {\displaystyle \sigma _{\beta }(1,\,0.9999)\to (1,\,0),} and σ β ( 1 , 1 ) = 1 / 2 {\displaystyle \sigma _{\beta }(1,\,1)=1/2} for all inputs: the closer the points are to the singular set ( x , x ) {\displaystyle (x,x)} , the slower they converge. However, softargmax does converge compactly on the non-singular set. Conversely, as ⁠ β → − ∞ {\displaystyle \beta \to -\infty } ⁠, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization". It is also the case that, for any fixed β, if one input ⁠ z i {\displaystyle z_{i}} ⁠ is much larger than the others relative to the temperature, T = 1 / β {\displaystyle T=1/\beta } , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1: σ ( 0 , 10 ) := σ 1 ( 0 , 10 ) = ( 1 / ( 1 + e 10 ) , e 10 / ( 1 + e 10 ) ) ≈ ( 0.00005 , 0.99995 ) {\displaystyle \sigma (0,\,10):=\sigma _{1}(0,\,10)=\left(1/\left(1+e^{10}\right),\,e^{10}/\left(1+e^{10}\right)\right)\approx (0.00005

    Read more →
  • Variational autoencoder

    Variational autoencoder

    In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling in 2013. It is part of the families of probabilistic graphical models and variational Bayesian methods. In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately. Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven for semi-supervised learning and supervised learning. == Overview of architecture and operation == A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances, see "Statistical distance VAE variants" below. == Formulation == From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x {\displaystyle x} by their chosen parameterized probability distribution p θ ( x ) = p ( x | θ ) {\displaystyle p_{\theta }(x)=p(x|\theta )} . This distribution is usually chosen to be a Gaussian N ( x | μ , σ ) {\displaystyle N(x|\mu ,\sigma )} which is parameterized by μ {\displaystyle \mu } and σ {\displaystyle \sigma } respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z {\displaystyle z} results in intractable integrals. Let us find p θ ( x ) {\displaystyle p_{\theta }(x)} via marginalizing over z {\displaystyle z} . p θ ( x ) = ∫ z p θ ( x , z ) d z , {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x,z})\,dz,} where p θ ( x , z ) {\displaystyle p_{\theta }({x,z})} represents the joint distribution under p θ {\displaystyle p_{\theta }} of the observable data x {\displaystyle x} and its latent representation or encoding z {\displaystyle z} . According to the chain rule, the equation can be rewritten as p θ ( x ) = ∫ z p θ ( x | z ) p θ ( z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x|z})p_{\theta }(z)\,dz} In the vanilla variational autoencoder, z {\displaystyle z} is usually taken to be a finite-dimensional vector of real numbers, and p θ ( x | z ) {\displaystyle p_{\theta }({x|z})} to be a Gaussian distribution. Then p θ ( x ) {\displaystyle p_{\theta }(x)} is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as Prior p θ ( z ) {\displaystyle p_{\theta }(z)} Likelihood p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} Posterior p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} Unfortunately, the computation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as q ϕ ( z | x ) ≈ p θ ( z | x ) {\displaystyle q_{\phi }({z|x})\approx p_{\theta }({z|x})} with ϕ {\displaystyle \phi } defined as the set of real values that parametrize q {\displaystyle q} . This is sometimes called amortized inference, since by "investing" in finding a good q ϕ {\displaystyle q_{\phi }} , one can later infer z {\displaystyle z} from x {\displaystyle x} quickly without doing any integrals. In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is computed by the probabilistic decoder, and the approximated posterior distribution q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is computed by the probabilistic encoder. Parametrize the encoder as E ϕ {\displaystyle E_{\phi }} , and the decoder as D θ {\displaystyle D_{\theta }} . == Evidence lower bound (ELBO) == Like many deep learning approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through backpropagation. For variational autoencoders, the idea is to jointly optimize the generative model parameters θ {\displaystyle \theta } to reduce the reconstruction error between the input and the output, and ϕ {\displaystyle \phi } to make q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} as close as possible to p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . As reconstruction loss, mean squared error and cross entropy are often used. The Kullback–Leibler divergence D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} can be used as a loss function to squeeze q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} under p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . This divergence loss expands to D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( z | x ) ] = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x ) p θ ( x , z ) ] = ln ⁡ p θ ( x ) + E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x , z ) ] . {\displaystyle {\begin{aligned}D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }(z|x)}{p_{\theta }(z|x)}}\right]\\&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})p_{\theta }(x)}{p_{\theta }(x,z)}}\right]\\&=\ln p_{\theta }(x)+\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})}{p_{\theta }(x,z)}}\right].\end{aligned}}} Now, define the evidence lower bound (ELBO): L θ , ϕ ( x ) := E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ p θ ( x , z ) q ϕ ( z | x ) ] = ln ⁡ p θ ( x ) − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ | x ) ) {\displaystyle L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))} Maximizing the ELBO θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\dis

    Read more →
  • IT baseline protection

    IT baseline protection

    The IT baseline protection (German: IT-Grundschutz) approach from the German Federal Office for Information Security (BSI) is a methodology to identify and implement computer security measures in an organization. The aim is the achievement of an adequate and appropriate level of security for IT systems. To reach this goal the BSI recommends "well-proven technical, organizational, personnel, and infrastructural safeguards". Organizations and federal agencies show their systematic approach to secure their IT systems (e.g. Information Security Management System) by obtaining an ISO/IEC 27001 Certificate on the basis of IT-Grundschutz. == Overview baseline security == The term baseline security signifies standard security measures for typical IT systems. It is used in various contexts with somewhat different meanings. For example: Microsoft Baseline Security Analyzer: Software tool focused on Microsoft operating system and services security Cisco security baseline: Vendor recommendation focused on network and network device security controls Nortel baseline security: Set of requirements and best practices with a focus on network operators ISO/IEC 13335-3 defines a baseline approach to risk management. This standard has been replaced by ISO/IEC 27005, but the baseline approach was not taken over yet into the 2700x series. There are numerous internal baseline security policies for organizations, The German BSI has a comprehensive baseline security standard, that is compliant with the ISO/IEC 27000-series == BSI IT baseline protection == The foundation of an IT baseline protection concept is initially not a detailed risk analysis. It proceeds from overall hazards. Consequently, sophisticated classification according to damage extent and probability of occurrence is ignored. Three protection needs categories are established. With their help, the protection needs of the object under investigation can be determined. Based on these, appropriate personnel, technical, organizational and infrastructural security measures are selected from the IT Baseline Protection Catalogs. The Federal Office for Security in Information Technology's IT Baseline Protection Catalogs offer a "cookbook recipe" for a normal level of protection. Besides probability of occurrence and potential damage extents, implementation costs are also considered. By using the Baseline Protection Catalogs, costly security analyses requiring expert knowledge are dispensed with, since overall hazards are worked with in the beginning. It is possible for the relative layman to identify measures to be taken and to implement them in cooperation with professionals. The BSI grants a baseline protection certificate as confirmation for the successful implementation of baseline protection. In stages 1 and 2, this is based on self declaration. In stage 3, an independent, BSI-licensed auditor completes an audit. Certification process internationalization has been possible since 2006. ISO/IEC 27001 certification can occur simultaneously with IT baseline protection certification. (The ISO/IEC 27001 standard is the successor of BS 7799-2). This process is based on the new BSI security standards. This process carries a development price which has prevailed for some time. Corporations having themselves certified under the BS 7799-2 standard are obliged to carry out a risk assessment. To make it more comfortable, most deviate from the protection needs analysis pursuant to the IT Baseline Protection Catalogs. The advantage is not only conformity with the strict BSI, but also attainment of BS 7799-2 certification. Beyond this, the BSI offers a few help aids like the policy template and the GSTOOL. One data protection component is available, which was produced in cooperation with the German Federal Commissioner for Data Protection and Freedom of Information and the state data protection authorities and integrated into the IT Baseline Protection Catalog. This component is not considered, however, in the certification process. == Baseline protection process == The following steps are taken pursuant to the baseline protection process during structure analysis and protection needs analysis: The IT network is defined. IT structure analysis is carried out. Protection needs determination is carried out. A baseline security check is carried out. IT baseline protection measures are implemented. Creation occurs in the following steps: IT structure analysis (survey) Assessment of protection needs Selection of actions Running comparison of nominal and actual. === IT structure analysis === An IT network includes the totality of infrastructural, organizational, personnel, and technical components serving the fulfillment of a task in a particular information processing application area. An IT network can thereby encompass the entire IT character of an institution or individual division, which is partitioned by organizational structures as, for example, a departmental network, or as shared IT applications, for example, a personnel information system. It is necessary to analyze and document the information technological structure in question to generate an IT security concept and especially to apply the IT Baseline Protection Catalogs. Due to today's usually heavily networked IT systems, a network topology plan offers a starting point for the analysis. The following aspects must be taken into consideration: The available infrastructure, The organizational and personnel framework for the IT network, Networked and non-networked IT systems employed in the IT network. The communications connections between IT systems and externally, IT applications run within the IT network. === Protection needs determination === The purpose of the protection needs determination is to investigate what protection is sufficient and appropriate for the information and information technology in use. In this connection, the damage to each application and the processed information, which could result from a breach of confidentiality, integrity or availability, is considered. Important in this context is a realistic assessment of the possible follow-on damages. A division into the three protection needs categories "low to medium", "high" and "very high" has proved itself of value. "Public", "internal" and "secret" are often used for confidentiality. === Modelling === Heavily networked IT systems typically characterize information technology in government and business these days. As a rule, therefore, it is advantageous to consider the entire IT system and not just individual systems within the scope of an IT security analysis and concept. To be able to manage this task, it makes sense to logically partition the entire IT system into parts and to separately consider each part or even an IT network. Detailed documentation about its structure is prerequisite for the use of the IT Baseline Protection Catalogs on an IT network. This can be achieved, for example, via the IT structure analysis described above. The IT Baseline Protection Catalog’s' components must ultimately be mapped onto the components of the IT network in question in a modelling step. === Baseline security check === The baseline security check is an organisational instrument offering a quick overview of the prevailing IT security level. With the help of interviews, the status quo of an existing IT network (as modelled by IT baseline protection) relative to the number of security measures implemented from the IT Baseline Protection Catalogs are investigated. The result is a catalog in which the implementation status "dispensable", "yes", "partly", or "no" is entered for each relevant measure. By identifying not yet, or only partially, implemented measures, improvement options for the security of the information technology in question are highlighted. The baseline security check gives information about measures, which are still missing (nominal vs. actual comparison). From this follows what remains to be done to achieve baseline protection through security. Not all measures suggested by this baseline check need to be implemented. Peculiarities are to be taken into account! It could be that several more or less unimportant applications are running on a server, which have lesser protection needs. In their totality, however, these applications are to be provided with a higher level of protection. This is called the (cumulation effect). The applications running on a server determine its need for protection. Several IT applications can run on an IT system. When this occurs, the application with the greatest need for protection determines the IT system’s protection category. Conversely, it is conceivable that an IT application with great protection needs does not automatically transfer this to the IT system. This may happen because the IT system is configured redundantly, or because only an inconsequential part is running on it. This is called the (distribution effect). This is the case, fo

    Read more →
  • Count sketch

    Count sketch

    Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams (these calculations require counting of the number of occurrences for the distinct elements of the stream). The sketch is nearly identical to the Feature hashing algorithm by John Moody, but differs in its use of hash functions with low dependence, which makes it more practical. In order to still have a high probability of success, the median trick is used to aggregate multiple count sketches, rather than the mean. These properties allow use for explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. == Intuitive explanation == The inventors of this data structure offer the following iterative explanation of its operation: at the simplest level, the output of a single hash function s mapping stream elements q into {+1, -1} is feeding a single up/down counter C. After a single pass over the data, the frequency n ( q ) {\displaystyle n(q)} of a stream element q can be approximated, although extremely poorly, by the expected value E [ C ⋅ s ( q ) ] {\displaystyle {\mathbf {E}}[C\cdot s(q)]} ; a straightforward way to improve the variance of the previous estimate is to use an array of different hash functions s i {\displaystyle s_{i}} , each connected to its own counter C i {\displaystyle C_{i}} . For each i, the E [ C i ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i}\cdot s_{i}(q)]=n(q)} still holds, so averaging across the i range will tighten the approximation; the previous construct still has a major deficiency: if a lower-frequency-but-still-important output element a exhibits a hash collision with a high-frequency element even for one of the s i {\displaystyle s_{i}} hashes, n ( a ) {\displaystyle n(a)} estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each C i {\displaystyle C_{i}} in the previous construct with an array of m counters (making the counter set into a two-dimensional matrix C i , j {\displaystyle C_{i,j}} ), with index j of a particular counter to be incremented/decremented selected via another set of hash functions h i {\displaystyle h_{i}} that map element q into the range {1..m}. Since E [ C i , h i ( q ) ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i,h_{i}(q)}\cdot s_{i}(q)]=n(q)} , averaging across all values of i will work. == Mathematical definition == 1. For constants w {\displaystyle w} and t {\displaystyle t} (to be defined later) independently choose d = 2 t + 1 {\displaystyle d=2t+1} random hash functions h 1 , … , h d {\displaystyle h_{1},\dots ,h_{d}} and s 1 , … , s d {\displaystyle s_{1},\dots ,s_{d}} such that h i : [ n ] → [ w ] {\displaystyle h_{i}:[n]\to [w]} and s i : [ n ] → { ± 1 } {\displaystyle s_{i}:[n]\to \{\pm 1\}} . It is necessary that the hash families from which h i {\displaystyle h_{i}} and s i {\displaystyle s_{i}} are chosen be pairwise independent. 2. For each item q i {\displaystyle q_{i}} in the stream, add s j ( q i ) {\displaystyle s_{j}(q_{i})} to the h j ( q i ) {\displaystyle h_{j}(q_{i})} th bucket of the j {\displaystyle j} th hash. At the end of this process, one has w d {\displaystyle wd} sums ( C i j ) {\displaystyle (C_{ij})} where C i , j = ∑ h i ( k ) = j s i ( k ) . {\displaystyle C_{i,j}=\sum _{h_{i}(k)=j}s_{i}(k).} To estimate the count of q {\displaystyle q} s one computes the following value: r q = median i = 1 d s i ( q ) ⋅ C i , h i ( q ) . {\displaystyle r_{q}={\text{median}}_{i=1}^{d}\,s_{i}(q)\cdot C_{i,h_{i}(q)}.} The values s i ( q ) ⋅ C i , h i ( q ) {\displaystyle s_{i}(q)\cdot C_{i,h_{i}(q)}} are unbiased estimates of how many times q {\displaystyle q} has appeared in the stream. The estimate r q {\displaystyle r_{q}} has variance O ( m i n { m 1 2 / w 2 , m 2 2 / w } ) {\displaystyle O(\mathrm {min} \{m_{1}^{2}/w^{2},m_{2}^{2}/w\})} , where m 1 {\displaystyle m_{1}} is the length of the stream and m 2 2 {\displaystyle m_{2}^{2}} is ∑ q ( ∑ i [ q i = q ] ) 2 {\displaystyle \sum _{q}(\sum _{i}[q_{i}=q])^{2}} . Furthermore, r q {\displaystyle r_{q}} is guaranteed to never be more than 2 m 2 / w {\displaystyle 2m_{2}/{\sqrt {w}}} off from the true value, with probability 1 − e − O ( t ) {\displaystyle 1-e^{-O(t)}} . === Vector formulation === Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function. Let M ( i ∈ [ d ] ) ∈ { − 1 , 0 , 1 } w × n {\displaystyle M^{(i\in [d])}\in \{-1,0,1\}^{w\times n}} , be a collection of d = 2 t + 1 {\displaystyle d=2t+1} matrices, defined by M h i ( j ) , j ( i ) = s i ( j ) {\displaystyle M_{h_{i}(j),j}^{(i)}=s_{i}(j)} for j ∈ [ w ] {\displaystyle j\in [w]} and 0 everywhere else. Then a vector v ∈ R n {\displaystyle v\in \mathbb {R} ^{n}} is sketched by C ( i ) = M ( i ) v ∈ R w {\displaystyle C^{(i)}=M^{(i)}v\in \mathbb {R} ^{w}} . To reconstruct v {\displaystyle v} we take v j ∗ = median i C j ( i ) s i ( j ) {\displaystyle v_{j}^{}={\text{median}}_{i}C_{j}^{(i)}s_{i}(j)} . This gives the same guarantees as stated above, if we take m 1 = ‖ v ‖ 1 {\displaystyle m_{1}=\|v\|_{1}} and m 2 = ‖ v ‖ 2 {\displaystyle m_{2}=\|v\|_{2}} . == Relation to Tensor sketch == The count sketch projection of the outer product of two vectors is equivalent to the convolution of two component count sketches. The count sketch computes a vector convolution C ( 1 ) x ∗ C ( 2 ) x T {\displaystyle C^{(1)}x\ast C^{(2)}x^{T}} , where C ( 1 ) {\displaystyle C^{(1)}} and C ( 2 ) {\displaystyle C^{(2)}} are independent count sketch matrices. Pham and Pagh show that this equals C ( x ⊗ x T ) {\displaystyle C(x\otimes x^{T})} – a count sketch C {\displaystyle C} of the outer product of vectors, where ⊗ {\displaystyle \otimes } denotes Kronecker product. The fast Fourier transform can be used to do fast convolution of count sketches. By using the face-splitting product such structures can be computed much faster than normal matrices.

    Read more →
  • Calibration (statistics)

    Calibration (statistics)

    There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes. In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." == In classification == Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case: Assignment value approach, see Garczarek (2002) Bayes approach, see Bennett (2002) Isotonic regression, see Zadrozny and Elkan (2002) Platt scaling (a form of logistic regression), see Lewis and Gale (1994) and Platt (1999) Bayesian Binning into Quantiles (BBQ) calibration, see Naeini, Cooper, Hauskrecht (2015) Beta calibration, see Kull, Filho, Flach (2017) === In probability prediction and forecasting === In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting. This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your discrimination is perfect but your calibration is miserable". In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill. == In regression == The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. This can be known as "inverse regression"; there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two: Reduction to binary tasks and subsequent pairwise coupling, see Hastie and Tibshirani (1998) Dirichlet calibration, see Gebel (2009) === Example === One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

    Read more →
  • Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view. == Introduction == VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory): Theory of consistency of learning processes What are (necessary and sufficient) conditions for consistency of a learning process based on the empirical risk minimization principle? Nonasymptotic theory of the rate of convergence of learning processes How fast is the rate of convergence of the learning process? Theory of controlling the generalization ability of learning processes How can one control the rate of convergence (the generalization ability) of the learning process? Theory of constructing learning machines How can one construct algorithms that can control the generalization ability? VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization. In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics. == Overview of VC theory in empirical processes == === Background on empirical processes === Let ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} be a measurable space. For any measure Q {\displaystyle Q} on ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} , and any measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } , define Q f = ∫ f d Q {\displaystyle Qf=\int fdQ} Measurability issues will be ignored here, for more technical detail see. Let F {\displaystyle {\mathcal {F}}} be a class of measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } and define: ‖ Q ‖ F = sup { | Q f | : f ∈ F } . {\displaystyle \|Q\|_{\mathcal {F}}=\sup\{\vert Qf\vert \ :\ f\in {\mathcal {F}}\}.} Let X 1 , … , X n {\displaystyle X_{1},\ldots ,X_{n}} be independent, identically distributed random elements of ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} . Then define the empirical measure P n = n − 1 ∑ i = 1 n δ X i , {\displaystyle \mathbb {P} _{n}=n^{-1}\sum _{i=1}^{n}\delta _{X_{i}},} where δ here stands for the Dirac measure. The empirical measure induces a map F → R {\displaystyle {\mathcal {F}}\to \mathbf {R} } given by: f ↦ P n f = 1 n ( f ( X 1 ) + . . . + f ( X n ) ) {\displaystyle f\mapsto \mathbb {P} _{n}f={\frac {1}{n}}(f(X_{1})+...+f(X_{n}))} Now suppose P is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes F {\displaystyle {\mathcal {F}}} for which statements such as the following hold: uniform law of large numbers: ‖ P n − P ‖ F → n 0 , {\displaystyle \|\mathbb {P} _{n}-P\|_{\mathcal {F}}{\underset {n}{\to }}0,} That is, as n → ∞ {\displaystyle n\to \infty } , | 1 n ( f ( X 1 ) + . . . + f ( X n ) ) − ∫ f d P | → 0 {\displaystyle \left|{\frac {1}{n}}(f(X_{1})+...+f(X_{n}))-\int fdP\right|\to 0} uniformly for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . uniform central limit theorem: G n = n ( P n − P ) ⇝ G , in ℓ ∞ ( F ) {\displaystyle \mathbb {G} _{n}={\sqrt {n}}(\mathbb {P} _{n}-P)\rightsquigarrow \mathbb {G} ,\quad {\text{in }}\ell ^{\infty }({\mathcal {F}})} In the former case F {\displaystyle {\mathcal {F}}} is called Glivenko–Cantelli class, and in the latter case (under the assumption ∀ x , sup f ∈ F | f ( x ) − P f | < ∞ {\displaystyle \forall x,\sup \nolimits _{f\in {\mathcal {F}}}\vert f(x)-Pf\vert <\infty } ) the class F {\displaystyle {\mathcal {F}}} is called Donsker or P-Donsker. A Donsker class is Glivenko–Cantelli in probability by an application of Slutsky's theorem. These statements are true for a single f {\displaystyle f} , by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . Intuitively then, the set F {\displaystyle {\mathcal {F}}} cannot be too large, and as it turns out that the geometry of F {\displaystyle {\mathcal {F}}} plays a very important role. One way of measuring how big the function set F {\displaystyle {\mathcal {F}}} is to use the so-called covering numbers. The covering number N ( ε , F , ‖ ⋅ ‖ ) {\displaystyle N(\varepsilon ,{\mathcal {F}},\|\cdot \|)} is the minimal number of balls { g : ‖ g − f ‖ < ε } {\displaystyle \{g:\|g-f\|<\varepsilon \}} needed to cover the set F {\displaystyle {\mathcal {F}}} (here it is obviously assumed that there is an underlying norm on F {\displaystyle {\mathcal {F}}} ). The entropy is the logarithm of the covering number. Two sufficient conditions are provided below, under which it can be proved that the set F {\displaystyle {\mathcal {F}}} is Glivenko–Cantelli or Donsker. A class F {\displaystyle {\mathcal {F}}} is P-Glivenko–Cantelli if it is P-measurable with envelope F such that P ∗ F < ∞ {\displaystyle P^{\ast }F<\infty } and satisfies: ∀ ε > 0 sup Q N ( ε ‖ F ‖ Q , F , L 1 ( Q ) ) < ∞ . {\displaystyle \forall \varepsilon >0\quad \sup \nolimits _{Q}N(\varepsilon \|F\|_{Q},{\mathcal {F}},L_{1}(Q))<\infty .} The next condition is a version of Dudley's theorem. If F {\displaystyle {\mathcal {F}}} is a class of functions such that ∫ 0 ∞ sup Q log ⁡ N ( ε ‖ F ‖ Q , 2 , F , L 2 ( Q ) ) d ε < ∞ {\displaystyle \int _{0}^{\infty }\sup \nolimits _{Q}{\sqrt {\log N\left(\varepsilon \|F\|_{Q,2},{\mathcal {F}},L_{2}(Q)\right)}}d\varepsilon <\infty } then F {\displaystyle {\mathcal {F}}} is P-Donsker for every probability measure P such that P ∗ F 2 < ∞ {\displaystyle P^{\ast }F^{2}<\infty } . In the last integral, the notation means ‖ f ‖ Q , 2 = ( ∫ | f | 2 d Q ) 1 2 {\displaystyle \|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}} . === Symmetrization === The majority of the arguments about how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section). It is presented here: Consider the empirical process: f ↦ ( P n − P ) f = 1 n ∑ i = 1 n ( f ( X i ) − P f ) {\displaystyle f\mapsto (\mathbb {P} _{n}-P)f={\dfrac {1}{n}}\sum _{i=1}^{n}(f(X_{i})-Pf)} Turns out that there is a connection between the empirical and the following symmetrized process: f ↦ P n 0 f = 1 n ∑ i = 1 n ε i f ( X i ) {\displaystyle f\mapsto \mathbb {P} _{n}^{0}f={\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})} The symmetrized process is a Rademacher process, conditionally on the data X i {\displaystyle X_{i}} . Therefore, it is a sub-Gaussian process by Hoeffding's inequality. Lemma (Symmetrization). For every nondecreasing, convex Φ: R → R and class of measurable functions F {\displaystyle {\mathcal {F}}} , E Φ ( ‖ P n − P ‖ F ) ≤ E Φ ( 2 ‖ P n 0 ‖ F ) {\displaystyle \mathbb {E} \Phi (\|\mathbb {P} _{n}-P\|_{\mathcal {F}})\leq \mathbb {E} \Phi \left(2\left\|\mathbb {P} _{n}^{0}\right\|_{\mathcal {F}}\right)} The proof of the Symmetrization lemma relies on introducing independent copies of the original variables X i {\displaystyle X_{i}} (sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature. The same proof method can be used to prove the Glivenko–Cantelli theorem. A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to P n 0 {\displaystyle \mathbb {P} _{n}^{0}} and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties. === VC Connection === It turns out that there is a fascinating connection between certain combinatorial properties of the set F {\displaystyle {\mathcal {F}}} and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik–Chervonenkis classes of sets – or shortly VC sets. Consider a collection C {\displaystyle {\mathcal {C}}} of subsets of the sample space X {\displaystyle

    Read more →
  • Pocket (service)

    Pocket (service)

    Pocket, formerly known as Read It Later, was a social bookmarking service for storing, sharing and discovering web bookmarks, first released in 2007. Mozilla, the developer of Pocket, announced in May 2025 that it was discontinuing the service and would shut it down in July of that year. == History == Pocket was introduced in August 2007 as a Mozilla Firefox browser extension named Read It Later by Nathan (Nate) Weiner. Once his product was used by millions of people, he moved his office to Silicon Valley and four other people joined the Read It Later team. Weiner's intention was for the application to be like a TiVo directory for web content and to give users access to that content on any device. Read It Later obtained venture capital investments of US$2.5 million in 2011 and $5.0 million in 2012. The 2011 funding came from Foundation Capital, Baseline Ventures, Google Ventures, Founder Collective and unnamed angel investors. The company rejected an acquisition offer by Evernote after showing concerns that Evernote intended to shut down the Read It Later service and amalgamate its functionality into Evernote's main service. Initially, the Read It Later app was available in a free version and a paid version that included additional features. After the rebranding to Pocket, all paid features were made available in a free and advertisement-free app. In May 2014, a paid subscription service called Pocket Premium was introduced, adding server-side storage of articles and more powerful search tools. In June 2015, Pocket was included in Firefox, via a toolbar button and link to a user's Pocket list in the bookmark's menu. The integration was controversial, as users displayed concerns for the direct integration of a proprietary service into an open source application, and that it could not be completely disabled without editing advanced settings, unlike other third-party extensions. A Mozilla spokesperson stated that the feature was meant to leverage the service's popularity among Firefox users and clarified that all code related to the integration was open source. The spokesperson added that "[Mozilla had] gotten lots of positive feedback about the integration from users". On February 27, 2017, Pocket announced that it had been acquired by Mozilla Corporation, the commercial arm of Firefox's non-profit development group. Mozilla staff stated that Pocket would continue to operate as an independent subsidiary but that it would be leveraged as part of an ongoing "Context Graph" project. There were plans to open-source the server-side code of Pocket, though only parts of the project had been open-sourced as of 2024. On May 22, 2025, Mozilla announced that it would shut down Pocket on July 8, 2025. Exports of user data would be available until October 8, 2025, when accounts would be deleted. The email newsletter Pocket Hits was rebranded as Ten Tabs on June 12 as part of the closure, with it being changed to release only on weekdays. == Functions == The application allows the user to save an article or web page to remote servers for later reading. The article is sent to the user's Pocket list (synced to all of their devices) for offline reading. Pocket makes the article more readable by removing clutter and enabling the user to add tags and adjust text settings. == User base == The application had 17 million users and 1 billion saves, as of September 2015. Pocket was listed among Time magazine's 50 Best Android Applications for 2013. == Reception == Kent German of CNET said that "Read It Later is oh so incredibly useful for saving all the articles and news stories I find while commuting or waiting in line." Erez Zukerman of PC World said that supporting the developer is enough reason to buy what he deemed a "handy app". Bill Barol of Forbes said that although Read It Later works less well than Instapaper, "it makes my beloved Instapaper look and feel a little stodgy." In 2015, Pocket was awarded a Material Design Award for Adaptive Layout by Google for their Android application.

    Read more →
  • Blockmodeling

    Blockmodeling

    Blockmodeling is a set or a coherent framework, that is used for analyzing social structure and also for setting procedure(s) for partitioning (clustering) social network's units (nodes, vertices, actors), based on specific patterns, which form a distinctive structure through interconnectivity. It is primarily used in statistics, machine learning and network science. As an empirical procedure, blockmodeling assumes that all the units in a specific network can be grouped together to such extent to which they are equivalent. Regarding equivalency, it can be structural, regular or generalized. Using blockmodeling, a network can be analyzed using newly created blockmodels, which transforms large and complex network into a smaller and more comprehensible one. At the same time, the blockmodeling is used to operationalize social roles. While some contend that the blockmodeling is just clustering methods, Bonacich and McConaghy state that "it is a theoretically grounded and algebraic approach to the analysis of the structure of relations". Blockmodeling's unique ability lies in the fact that it considers the structure not just as a set of direct relations, but also takes into account all other possible compound relations that are based on the direct ones. The principles of blockmodeling were first introduced by Francois Lorrain and Harrison C. White in 1971. Blockmodeling is considered as "an important set of network analytic tools" as it deals with delineation of role structures (the well-defined places in social structures, also known as positions) and the discerning the fundamental structure of social networks. According to Batagelj, the primary "goal of blockmodeling is to reduce a large, potentially incoherent network to a smaller comprehensible structure that can be interpreted more readily". Blockmodeling was at first used for analysis in sociometry and psychometrics, but has now spread also to other sciences. == Definition == A network as a system is composed of (or defined by) two different sets: one set of units (nodes, vertices, actors) and one set of links between the units. Using both sets, it is possible to create a graph, describing the structure of the network. During blockmodeling, the researcher is faced with two problems: how to partition the units (e.g., how to determine the clusters (or classes), that then form vertices in a blockmodel) and then how to determine the links in the blockmodel (and at the same time the values of these links). In the social sciences, the networks are usually social networks, composed of several individuals (units) and selected social relationships among them (links). Real-world networks can be large and complex; blockmodeling is used to simplify them into smaller structures that can be easier to interpret. Specifically, blockmodeling partitions the units into clusters and then determines the ties among the clusters. At the same time, blockmodeling can be used to explain the social roles existing in the network, as it is assumed that the created cluster of units mimics (or is closely associated with) the units' social roles. Blockmodeling can thus be defined as a set of approaches for partitioning units into clusters (also known as positions) and links into blocks, which are further defined by the newly obtained clusters. A block (also blockmodel) is defined as a submatrix, that shows interconnectivity (links) between nodes, present in the same or different clusters. Each of these positions in the cluster is defined by a set of (in)direct ties to and from other social positions. These links (connections) can be directed or undirected; there can be multiple links between the same pair of objects or they can have weights on them. If there are not any multiple links in a network, it is called a simple network. A matrix representation of a graph is composed of ordered units, in rows and columns, based on their names. The ordered units with similar patterns of links are partitioned together in the same clusters. Clusters are then arranged together so that units from the same clusters are placed next to each other, thus preserving interconnectivity. In the next step, the units (from the same clusters) are transformed into a blockmodel. With this, several blockmodels are usually formed, one being core cluster and others being cohesive; a core cluster is always connected to cohesive ones, while cohesive ones cannot be linked together. Clustering of nodes is based on the equivalence, such as structural and regular. The primary objective of the matrix form is to visually present relations between the persons included in the cluster. These ties are coded dichotomously (as present or absent), and the rows in the matrix form indicate the source of the ties, while the columns represent the destination of the ties. Equivalence can have two basic approaches: the equivalent units have the same connection pattern to the same neighbors or these units have same or similar connection pattern to different neighbors. If the units are connected to the rest of network in identical ways, then they are structurally equivalent. Units can also be regularly equivalent, when they are equivalently connected to equivalent others. With blockmodeling, it is necessary to consider the issue of results being affected by measurement errors in the initial stage of acquiring the data. == Different approaches == Regarding what kind of network is undergoing blockmodeling, a different approach is necessary. Networks can be one–mode or two–mode. In the former all units can be connected to any other unit and where units are of the same type, while in the latter the units are connected only to the unit(s) of a different type. Regarding relationships between units, they can be single–relational or multi–relational networks. Further more, the networks can be temporal or multilevel and also binary (only 0 and 1) or signed (allowing negative ties)/values (other values are possible) networks. Different approaches to blockmodeling can be grouped into two main classes: deterministic blockmodeling and stochastic blockmodeling approaches. Deterministic blockmodeling is then further divided into direct and indirect blockmodeling approaches. Among direct blockmodeling approaches are: structural equivalence and regular equivalence. Structural equivalence is a state, when units are connected to the rest of the network in an identical way(s), while regular equivalence occurs when units are equally related to equivalent others (units are not necessarily sharing neighbors, but have neighbour that are themselves similar). Indirect blockmodeling approaches, where partitioning is dealt with as a traditional cluster analysis problem (measuring (dis)similarity results in a (dis)similarity matrix), are: conventional blockmodeling, generalized blockmodeling: generalized blockmodeling of binary networks, generalized blockmodeling of valued networks and generalized homogeneity blockmodeling, prespecified blockmodeling. According to Brusco and Steinley (2011), the blockmodeling can be categorized (using a number of dimensions): deterministic or stochastic blockmodeling, one–mode or two–mode networks, signed or unsigned networks, exploratory or confirmatory blockmodeling. == Blockmodels == Blockmodels (sometimes also block models) are structures in which: vertices (e.g., units, nodes) are assembled within a cluster, with each cluster identified as a vertex; from such vertices a graph can be constructed; combinations of all the links (ties), represented in a block as a single link between positions, while at the same time constructing one tie for each block. In a case, when there are no ties in a block, there will be no ties between the two positions that define the block. Computer programs can partition the social network according to pre-set conditions. When empirical blocks can be reasonably approximated in terms of ideal blocks, such blockmodels can be reduced to a blockimage, which is a representation of the original network, capturing its underlying 'functional anatomy'. Thus, blockmodels can "permit the data to characterize their own structure", and at the same time not seek to manifest a preconceived structure imposed by the researcher. Blockmodels can be created indirectly or directly, based on the construction of the criterion function. Indirect construction refers to a function, based on "compatible (dis)similarity measure between paris of units", while the direct construction is "a function measuring the fit of real blocks induced by a given clustering to the corresponding ideal blocks with perfect relations within each cluster and between clusters according to the considered types of connections (equivalence)". === Types === Blockmodels can be specified regarding the intuition, substance or the insight into the nature of the studied network; this can result in such models as follows: parent-child role systems, organizational hierarchies, systems of

    Read more →
  • Sparse PCA

    Sparse PCA

    Sparse principal component analysis (SPCA or sparse PCA) is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables. A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings, are equal to zero. The number of nonzero loadings is called the cardinality of the SPC. == Mathematical formulation == Consider a data matrix, X {\displaystyle X} , where each of the p {\displaystyle p} columns represent an input variable, and each of the n {\displaystyle n} rows represents an independent sample from data population. One assumes each column of X {\displaystyle X} has mean zero, otherwise one can subtract column-wise mean from each element of X {\displaystyle X} . Let Σ = 1 n − 1 X ⊤ X {\displaystyle \Sigma ={\frac {1}{n-1}}X^{\top }X} be the empirical covariance matrix of X {\displaystyle X} , which has dimension p × p {\displaystyle p\times p} . Given an integer k {\displaystyle k} with 1 ≤ k ≤ p {\displaystyle 1\leq k\leq p} , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector v ∈ R p {\displaystyle v\in \mathbb {R} ^{p}} while constraining its cardinality: max v T Σ v subject to ‖ v ‖ 2 = 1 ‖ v ‖ 0 ≤ k . {\displaystyle {\begin{aligned}\max \quad &v^{T}\Sigma v\\{\text{subject to}}\quad &\left\Vert v\right\Vert _{2}=1\\&\left\Vert v\right\Vert _{0}\leq k.\end{aligned}}} Eq. 1 The first constraint specifies that v is a unit vector. In the second constraint, ‖ v ‖ 0 {\displaystyle \left\Vert v\right\Vert _{0}} represents the ℓ 0 {\displaystyle \ell _{0}} pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue. If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ. After finding the optimal solution v, one deflates Σ to obtain a new matrix Σ 1 = Σ − ( v T Σ v ) v v T , {\displaystyle \Sigma _{1}=\Sigma -(v^{T}\Sigma v)vv^{T},} and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced. The following equivalent definition is in matrix form. Let V {\displaystyle V} be a p×p symmetric matrix, one can rewrite the sparse PCA problem as max T r ( Σ V ) subject to T r ( V ) = 1 ‖ V ‖ 0 ≤ k 2 R a n k ( V ) = 1 , V ⪰ 0. {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\Vert V\Vert _{0}\leq k^{2}\\&Rank(V)=1,V\succeq 0.\end{aligned}}} Eq. 2 Tr is the matrix trace, and ‖ V ‖ 0 {\displaystyle \Vert V\Vert _{0}} represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has V = v v T {\displaystyle V=vv^{T}} , so Eq. 2 is equivalent to Eq. 1. Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program max T r ( Σ V ) subject to T r ( V ) = 1 | V i , i | ≤ z i , ∀ i ∈ { 1 , . . . , p } , | V i , j | ≤ 1 2 z i , ∀ i , j ∈ { 1 , . . . , p } : i ≠ j , V ⪰ 0 , z ∈ { 0 , 1 } p , ∑ i z i ≤ k {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\vert V_{i,i}\vert \leq z_{i},\forall i\in \{1,...,p\},\vert V_{i,j}\vert \leq {\frac {1}{2}}z_{i},\forall i,j\in \{1,...,p\}:i\neq j,\\&V\succeq 0,z\in \{0,1\}^{p},\sum _{i}z_{i}\leq k\end{aligned}}} Eq. 3 Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense. == Computational considerations == As most sparse problems, variable selection in SPCA is a computationally intractable non-convex NP-hard problem, therefore greedy sub-optimal algorithms are often employed to find solutions. Note also that SPCA introduces hyperparameters quantifying in what capacity large parameter values are penalized. These might need tuning to achieve satisfactory performance, thereby adding to the total computational cost. == Algorithms for SPCA == Several alternative approaches (of Eq. 1) have been proposed, including a regression framework, a penalized matrix decomposition framework, a convex relaxation/semidefinite programming framework, a generalized power method framework an alternating maximization framework forward-backward greedy search and exact methods using branch-and-bound techniques, a certifiably optimal branch-and-bound approach Bayesian formulation framework. A certifiably optimal mixed-integer semidefinite branch-and-cut approach The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper. === Notes on Semidefinite Programming Relaxation === It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP). If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time: max T r ( Σ V ) subject to T r ( V ) = 1 1 T | V | 1 ≤ k V ⪰ 0. {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\mathbf {1} ^{T}|V|\mathbf {1} \leq k\\&V\succeq 0.\end{aligned}}} Eq. 3 In the second constraint, 1 {\displaystyle \mathbf {1} } is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V. The optimal solution V {\displaystyle V} to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, V {\displaystyle V} can be truncated to retain only the dominant eigenvector. While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates == Applications == === Financial Data Analysis === Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs. === Biology === Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis. === High-dimensional Hypothesis Testing === Contemporary datasets often have the number of input variables ( p {\displaystyle p} ) comparable with or even much larger than the number of samples ( n {\displaystyle n} ). It has been shown that if p / n {\displaystyle p/n} does not converge to zero, the classical PCA is not consistent. In other words, if we let k = p {\displaystyle k=p} in Eq. 1, then the optimal value does not converge to the largest eigenvalue of data population when the sample size n → ∞ {\displaystyle n\rightarrow \infty } , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if p ≫ n . {\displaystyle p\gg n.} The k-sparse largest eigenvalue (the optimal value of Eq. 1) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting. Consider a hypothesis test where the null hypothesis specifies that data X {\displaystyle X} are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data X {\displaystyle X} is generated from a spiked model with signal strength θ {\displaystyle \theta } : H 0 : X ∼ N ( 0 , I p ) , H 1 : X ∼ N ( 0 , I p + θ v v T ) , {\displaystyle H_{0}:X\sim N(0,I_{p}),\quad H_{1}:X\sim N(0,I_{p}+\theta vv^{T}),} where v ∈ R p {\displaystyle v\in \mathbb {R} ^{p}

    Read more →