AI Email Response Generator Free

AI Email Response Generator Free — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Spotify Kids

    Spotify Kids

    Spotify Kids is a Swedish kid-friendly Music streaming service developed by Spotify. It offers curated content for children, including music, audiobooks, lullabies, and bedtime stories, while providing their parents with parental controls. The service is only available to subscribers to Spotify's Premium Family subscription plan. == Function == Spotify Kids is a Swedish Kid-friendly Music Streaming Service that allows children to browse Spotify with parental controls. Using the app, parents can view their children's listening history, block specific songs, and share playlists with their children. The app also includes sing-along songs, playlists designed for young children, and curated audiobooks, lullabies, and bedtime stories. Access is included in Spotify's Premium Family subscription plan, and is exclusive to subscribers to the plan. Users can configure the app for a specific age group upon first launch. The playlists on Spotify Kids are curated by groups including Discovery Kids, Nickelodeon, Universal Pictures, and The Walt Disney Company. All content on the Spotify Kids app is curated by editors. As of March 2021, there were roughly 8,000 songs available on the platform. The design of the Spotify Kids app is colorful, and user interface varies depending on the age group for which the app is configured. Spotify Kids is designed to comply with consent and data collection regulations for apps used by children. TechCrunch explains that it is "designed on a grand scale to drive subscriptions to Spotify's top-tier $14.99-per-month Premium Family Plan." == Release == After being beta tested in Ireland in October 2019, it was released as a beta across the United Kingdom on February 11, 2020. It was later released in Sweden, Denmark, Australia, New Zealand, Mexico, Argentina, and Brazil. On March 31, 2021, it was made available in France, Canada, and the United States.

    Read more →
  • Variable kernel density estimation

    Variable kernel density estimation

    In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional. == Rationale == Given a set of samples, { x → i } {\displaystyle \lbrace {\vec {x}}_{i}\rbrace } , we wish to estimate the density, P ( x → ) {\displaystyle P({\vec {x}})} , at a test point, x → {\displaystyle {\vec {x}}} : P ( x → ) ≈ W n h D {\displaystyle P({\vec {x}})\approx {\frac {W}{nh^{D}}}} W = ∑ i = 1 n w i {\displaystyle W=\sum _{i=1}^{n}w_{i}} w i = K ( x → − x → i h ) {\displaystyle w_{i}=K\left({\frac {{\vec {x}}-{\vec {x}}_{i}}{h}}\right)} where n is the number of samples, K is the "kernel", h is its width and D is the number of dimensions in x → {\displaystyle {\vec {x}}} . The kernel can be thought of as a simple, linear filter. Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators, the parameter, h, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach will not be covered here. == Balloon estimators == A common method of varying the kernel width is to make it inversely proportional to the density at the test point: h = k [ n P ( x → ) ] 1 / D {\displaystyle h={\frac {k}{\left[nP({\vec {x}})\right]^{1/D}}}} where k is a constant. If we back-substitute the estimated PDF, and assuming a Gaussian kernel function, we can show that W is a constant: W = k D ( 2 π ) D / 2 {\displaystyle W=k^{D}(2\pi )^{D/2}} A similar derivation holds for any kernel whose normalising function is of the order hD, although with a different constant factor in place of the (2 π)D/2 term. This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique. There are two components to the error: a variance term and a bias term. The variance term is given as: e 1 = P ∫ K 2 n h D {\displaystyle e_{1}={\frac {P\int K^{2}}{nh^{D}}}} . The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out: e 2 = h 2 n ∇ 2 P {\displaystyle e_{2}={\frac {h^{2}}{n}}\nabla ^{2}P} An optimal kernel width that minimizes the error of each estimate can thus be derived. == Use for statistical classification == The method is particularly effective when applied to statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately, using different bandwidth parameters, and then compare them as in Taylor. Alternatively, we can divide up the sum based on the class of each sample: P ( j , x → ) ≈ 1 n ∑ i = 1 , c i = j n w i {\displaystyle P(j,{\vec {x}})\approx {\frac {1}{n}}\sum _{i=1,c_{i}=j}^{n}w_{i}} where ci is the class of the ith sample. The class of the test point may be estimated through maximum likelihood.

    Read more →
  • Correspondence analysis

    Correspondence analysis

    Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a manner similar to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test. Although the χ 2 {\displaystyle \chi ^{2}} statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the χ 2 {\displaystyle \chi ^{2}} statistic is in fact a scalar not a metric. == Details == Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra. === Preprocessing === Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed. First compute a set of weights for the columns and the rows (sometimes called masses), where row and column weights are given by the row and column vectors, respectively: w m = 1 n C C 1 , w n = 1 n C 1 T C . {\displaystyle w_{m}={\frac {1}{n_{C}}}C\mathbf {1} ,\quad w_{n}={\frac {1}{n_{C}}}\mathbf {1} ^{T}C.} Here n C = ∑ i = 1 n ∑ j = 1 m C i j {\displaystyle n_{C}=\sum _{i=1}^{n}\sum _{j=1}^{m}C_{ij}} is the sum of all cell values in matrix C, or short the sum of C, and 1 {\displaystyle \mathbf {1} } is a column vector of ones with the appropriate dimension. Put in simple words, w m {\displaystyle w_{m}} is just a vector whose elements are the row sums of C divided by the sum of C, and w n {\displaystyle w_{n}} is a vector whose elements are the column sums of C divided by the sum of C. The weights are transformed into diagonal matrices W m = diag ⁡ ( 1 / w m ) {\displaystyle W_{m}=\operatorname {diag} (1/{\sqrt {w_{m}}})} and W n = diag ⁡ ( 1 / w n ) {\displaystyle W_{n}=\operatorname {diag} (1/{\sqrt {w_{n}}})} where the diagonal elements of W n {\displaystyle W_{n}} are 1 / w n {\displaystyle 1/{\sqrt {w_{n}}}} and those of W m {\displaystyle W_{m}} are 1 / w m {\displaystyle 1/{\sqrt {w_{m}}}} respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P {\displaystyle P} by dividing C {\displaystyle C} by its sum P = 1 n C C . {\displaystyle P={\frac {1}{n_{C}}}C.} In simple words, Matrix P {\displaystyle P} is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix S {\displaystyle S} , sometimes called the matrix of standardized residuals, by matrix multiplication as S = W m ( P − w m w n ) W n {\displaystyle S=W_{m}(P-w_{m}w_{n})W_{n}} Note, the vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are combined in an outer product resulting in a matrix of the same dimensions as P {\displaystyle P} . In words the formula reads: matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is subtracted from matrix P {\displaystyle P} and the resulting matrix is scaled (weighted) by the diagonal matrices W m {\displaystyle W_{m}} and W n {\displaystyle W_{n}} . Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W m {\displaystyle W_{m}} or W n {\displaystyle W_{n}} , respectively. === Interpretation of preprocessing === The vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} from matrix P {\displaystyle P} is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} . In fact matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is identical with the matrix of expected frequencies in the chi-squared test. Therefore S {\displaystyle S} is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here. === Orthogonal components === The table S {\displaystyle S} is then decomposed by a singular value decomposition as S = U Σ V ∗ {\displaystyle S=U\Sigma V^{}\,} where U {\displaystyle U} and V {\displaystyle V} are the left and right singular vectors of S {\displaystyle S} and Σ {\displaystyle \Sigma } is a square diagonal matrix with the singular values σ i {\displaystyle \sigma _{i}} of S {\displaystyle S} on the diagonal. Σ {\displaystyle \Sigma } is of dimension p ≤ ( min ( m , n ) − 1 ) {\displaystyle p\leq (\min(m,n)-1)} hence U {\displaystyle U} is of dimension m×p and V {\displaystyle V} is of n×p. As orthonormal vectors U {\displaystyle U} and V {\displaystyle V} fulfill U ∗ U = V ∗ V = I {\displaystyle U^{}U=V^{}V=I} . In other words, the multivariate information that is contained in C {\displaystyle C} as well as in S {\displaystyle S} is now distributed across two (coordinate) matrices U {\displaystyle U} and V {\displaystyle V} and a diagonal (scaling) matrix Σ {\displaystyle \Sigma } . The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1. === Inertia === While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia. The sum of the squared singular values is the total inertia I {\displaystyle \mathrm {I} } of the data table, computed as I = ∑ i = 1 p σ i 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{p}\sigma _{i}^{2}.} The total inertia I {\displaystyle \mathrm {I} } of the data table can also computed directly from S {\displaystyle S} as I = ∑ i = 1 n ∑ j = 1 m s i j 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{n}\sum _{j=1}^{m}s_{ij}^{2}.} The amount of inertia covered by the i-th set of singular vectors is ι i {\displaystyle \iota _{i}} , the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore, all principal inertia values are expressed as portion ϵ i {\displaystyle \epsilon _{i}} of the total inertia ϵ i = σ i 2 / ∑ i = 1 p σ i 2 {\displaystyle \epsilon _{i}=\sigma _{i}^{2}/\sum _{i=1}^{p}\sigma _{i}^{2}} and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions ϵ i {\displaystyle \epsilon _{i}} . === Coordinates === To transform the singular vectors to coordinates which preserve the chi-square distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for

    Read more →
  • VIGRA

    VIGRA

    VIGRA is the abbreviation for "Vision with Generic Algorithms". It is a free open-source computer vision library which focuses on customizable algorithms and data structures. VIGRA component can be easily adapted to specific needs of target application without compromising execution speed, by using template techniques similar to those in the C++ Standard Template Library. == Features == VIGRA is cross-platform, with working builds on Microsoft Windows, Mac OS X, Linux, and OpenBSD. Since version 1.7.1, VIGRA provides Python bindings based on numpy framework. == History == VIGRA was originally designed and implemented by scientists at University of Hamburg faculty of computer science; its core maintainers are now working at Heidelberg Collaboratory for Image Processing (HCI) University of Heidelberg. In the meantime, many developers have contributed to the project. == Application == CellCognition and ilastik uses VIGRA computer vision library. OpenOffice.org uses VIGRA as part of its headless software rendering backend; LibreOffice does so until version 5.2.

    Read more →
  • Edits (app)

    Edits (app)

    Edits is an American photo and short form video editing software service owned by Meta Platforms. It allows users to create videos and edit them by using features like green screens, and AI animation, and also provides real-time statistics to Instagram creators to track their accounts. Accounts directly from Instagram can be imported, and videos can be exported vice-versa. It is available solely on iOS and Android. On Apple, it supports over 32 different languages, including French, Spanish, and Chinese. It has been noted by critics as a direct competitor for apps like CapCut, owned by Chinese brand ByteDance. The Instagram head, Adam Mosseri, also acknowledged these similarities. Launched on April 22 for both iOS and Android. It received over 5M+ users on Apple and Android combined in its first 4 days since its launch. == History == On January 19, 2025, following the ban of all ByteDance Apps from the Google Play Store, and App Store, Instagram head Adam Mosseri announced on Threads that they would be launching the app in February for iOS, followed by an Android counterpart. He said the app is working with select people to test its features. In a separate post, he emphasized that the app is "more for creators than casual video makers". == Features == Edits contains many similar features to other competition of video editors like KineMaster, Inshot, and CapCut. When creating a video, users have the option to export in resolution of HD, 4K, and 2K, along with having HDR and SDR support. Like many traditional video editing software, it includes a timeline, and basic undo-redo buttons. On the bottom bar, 7 tabs for editing exist, namely the Split, Volume, Adjust, Speed, Delete, Filters, Green Screen, Voice FX, Extract Audio, Mirror, Slip, Replace and Duplicate bars. Basic features, like splitting, and adjusting speed and volume of clips are present, along with more advanced Green Screens, and AI features. Being a mobile video editor app, Edits also has drag-and-drop features to ease customer usage. Users have the ability to record videos directly within the app. This feature allows users to create content without needing extra software or devices. They can choose from several focal lengths, which affect how close or wide the shot appears. The app also supports different frame rates. Users have the ability to record videos directly within the app. This feature allows users to create content without needing extra software or devices. Once users are done filming your clips, they can simply transfer them into a project to start editing immediately. Upcoming features for the app include Keyframes, AI-powered modification, Collaboration, and Enhanced creativity. == Reception == Since its release, it received over 5 million downloads in 4 days. Critically, the app received great rankings from many. From users, the app received an average of 4.45 stars over Google Play Store and App Store in the first few days, with Google Play Store receiving the least stars. As in reviews, it was received mixed by the public. Many people praised the smoothness and intuivity of the app. "The app is more than just a basic editor, offering a full suite of creative tools, including a dedicated tab for inspiration and trending audio, as well as a tab for managing drafts," said a blogger. Some users were disappointed with the range of editing tools, some users have noted that it could benefit from more transition options between clips. Some even reported crashing between clips.

    Read more →
  • Exploratory blockmodeling

    Exploratory blockmodeling

    Exploratory blockmodeling is an (inductive) approach (or a group of approaches) in blockmodeling regarding the specification of an ideal blockmodel. This approach, also known as hypotheses-generating, is the simplest approach, as it "merely involves the definition of the block types permitted as well as of the number of clusters." With this approach, researcher usually defines the best possible blockmodel, which then represent the base for the analysis of the whole network. This approach is usually based on: previous analyses and theoretical considerations, using stricker blockmodel and block types, where the structural equivalence is stricker than the regular equivalence and using smaller number of classes. The opposite approach is called a confirmatory blockmodeling.

    Read more →
  • Mixture model

    Mixture model

    In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation. Mixture models should not be confused with models for compositional data, i.e., data whose components are constrained to sum to a constant value (1, 100%, etc.). However, compositional models can be thought of as mixture models, where members of the population are sampled at random. Conversely, mixture models can be thought of as compositional models, where the total size reading population has been normalized to 1. == Structure == === General mixture model === A typical finite-dimensional mixture model is a hierarchical model consisting of the following components: N random variables that are observed, each distributed according to a mixture of K components, with the components belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters. However, it is also possible to have a finite mixture model where each component belongs to a different parametric family of distributions, for example, a mixture of a multivariate normal distribution and a generalized hyperbolic distribution. N random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution A set of K mixture weights, which are probabilities that sum to 1. A set of K parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components are Gaussian distributions, there will be a mean and variance for each component. If the mixture components are categorical distributions (e.g., when each observation is a token from a finite alphabet of size V), there will be a vector of V probabilities summing to 1. In addition, in a Bayesian setting, the mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. In such a case, the weights are typically viewed as a K-dimensional random vector drawn from a Dirichlet distribution (the conjugate prior of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors. Mathematically, a basic parametric mixture model can be described as follows: K = number of mixture components N = number of observations θ i = 1 … K = parameter of distribution of observation associated with component i ϕ i = 1 … K = mixture weight, i.e., prior probability of a particular component i ϕ = K -dimensional vector composed of all the individual ϕ 1 … K ; must sum to 1 z i = 1 … N = component of observation i x i = 1 … N = observation i F ( x | θ ) = probability distribution of an observation, parametrized on θ z i = 1 … N ∼ Categorical ⁡ ( ϕ ) x i = 1 … N | z i = 1 … N ∼ F ( θ z i ) {\displaystyle {\begin{array}{lcl}K&=&{\text{number of mixture components}}\\N&=&{\text{number of observations}}\\\theta _{i=1\dots K}&=&{\text{parameter of distribution of observation associated with component }}i\\\phi _{i=1\dots K}&=&{\text{mixture weight, i.e., prior probability of a particular component }}i\\{\boldsymbol {\phi }}&=&K{\text{-dimensional vector composed of all the individual }}\phi _{1\dots K}{\text{; must sum to 1}}\\z_{i=1\dots N}&=&{\text{component of observation }}i\\x_{i=1\dots N}&=&{\text{observation }}i\\F(x|\theta )&=&{\text{probability distribution of an observation, parametrized on }}\theta \\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}|z_{i=1\dots N}&\sim &F(\theta _{z_{i}})\end{array}}} In a Bayesian setting, all parameters are associated with random variables, as follows: K , N = as above θ i = 1 … K , ϕ i = 1 … K , ϕ = as above z i = 1 … N , x i = 1 … N , F ( x | θ ) = as above α = shared hyperparameter for component parameters β = shared hyperparameter for mixture weights H ( θ | α ) = prior probability distribution of component parameters, parametrized on α θ i = 1 … K ∼ H ( θ | α ) ϕ ∼ S y m m e t r i c - D i r i c h l e t K ⁡ ( β ) z i = 1 … N | ϕ ∼ Categorical ⁡ ( ϕ ) x i = 1 … N | z i = 1 … N , θ i = 1 … K ∼ F ( θ z i ) {\displaystyle {\begin{array}{lcl}K,N&=&{\text{as above}}\\\theta _{i=1\dots K},\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N},F(x|\theta )&=&{\text{as above}}\\\alpha &=&{\text{shared hyperparameter for component parameters}}\\\beta &=&{\text{shared hyperparameter for mixture weights}}\\H(\theta |\alpha )&=&{\text{prior probability distribution of component parameters, parametrized on }}\alpha \\\theta _{i=1\dots K}&\sim &H(\theta |\alpha )\\{\boldsymbol {\phi }}&\sim &\operatorname {Symmetric-Dirichlet} _{K}(\beta )\\z_{i=1\dots N}|{\boldsymbol {\phi }}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}|z_{i=1\dots N},\theta _{i=1\dots K}&\sim &F(\theta _{z_{i}})\end{array}}} This characterization uses F and H to describe arbitrary distributions over observations and parameters, respectively. Typically H will be the conjugate prior of F. The two most common choices of F are Gaussian aka "normal" (for real-valued observations) and categorical (for discrete observations). Other common possibilities for the distribution of the mixture components are: Binomial distribution, for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences Multinomial distribution, similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey) Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs Poisson distribution, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices Multivariate normal distribution (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed Multivariate Student's t-distribution, for vectors of heavy-tailed correlated outcomes A vector of Bernoulli-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below === Specific examples === ==== Gaussian mixture model ==== A typical non-Bayesian Gaussian mixture model looks like this: K , N = as above ϕ i = 1 … K , ϕ = as above z i = 1 … N , x i = 1 … N = as above θ i = 1 … K = { μ i = 1 … K , σ i = 1 … K 2 } μ i = 1 … K = mean of component i σ i = 1 … K 2 = variance of component i z i = 1 … N ∼ Categorical ⁡ ( ϕ ) x i = 1 … N ∼ N ( μ z i , σ z i 2 ) {\displaystyle {\begin{array}{lcl}K,N&=&{\text{as above}}\\\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N}&=&{\text{as above}}\\\theta _{i=1\dots K}&=&\{\mu _{i=1\dots K},\sigma _{i=1\dots K}^{2}\}\\\mu _{i=1\dots K}&=&{\text{mean of component }}i\\\sigma _{i=1\dots K}^{2}&=&{\text{variance of component }}i\\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}&\sim &{\mathcal {N}}(\mu _{z_{i}},\sigma _{z_{i}}^{2})\end{array}}} A Bayesian version of a Gaussian mixture model is as follows: K , N = as above ϕ i = 1 … K , ϕ = as above z i = 1 … N , x i = 1 … N = as above θ i = 1 … K = { μ i = 1 … K , σ i = 1 … K 2 } μ i = 1 … K = mean of component i σ i = 1 … K 2 = variance of component i μ 0 , λ , ν , σ 0 2 = shared hyperparameters μ i = 1 … K ∼ N ( μ 0 , λ σ i 2 ) σ i = 1 … K 2 ∼ I n v e r s e - G a m m a ⁡ ( ν , σ 0 2 ) ϕ ∼ S y m m e t r i c - D i r i c h l e t K ⁡ ( β ) z i = 1 … N ∼ Categorical ⁡ ( ϕ ) x i = 1 … N ∼ N ( μ z i , σ z i 2 ) {\displaystyle {\begin{array}{lcl}K,N&=&{\text{as above}}\\\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N}&=&{\text{as above}}\\\theta _{i=1\

    Read more →
  • Latent and observable variables

    Latent and observable variables

    In statistics, latent variables (from Latin: present participle of lateo 'lie hidden') are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including engineering, medicine, ecology, physics, machine learning/artificial intelligence, natural language processing, bioinformatics, chemometrics, demography, economics, management, political science, psychology and the social sciences. Latent variables may correspond to aspects of physical reality. These could in principle be measured, but may not be for practical reasons. Among the earliest expressions of this idea is Francis Bacon's polemic the Novum Organum, itself a challenge to the more traditional logic expressed in Aristotle's Organon: But the latent process of which we speak, is far from being obvious to men’s minds, beset as they now are. For we mean not the measures, symptoms, or degrees of any process which can be exhibited in the bodies themselves, but simply a continued process, which, for the most part, escapes the observation of the senses. In this situation, the term hidden variables is commonly used, reflecting the fact that the variables are meaningful, but not observable. Other latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures. The terms hypothetical variables or hypothetical constructs may be used in these situations. The use of latent variables can serve to reduce the dimensionality of data. Many observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable "sub-symbolic" data in the real world to symbolic data in the modeled world. == Examples == === Psychology === Latent variables, as created by factor analytic methods, generally represent "shared" variance, or the degree to which variables "move" together. Variables that have no correlation cannot result in a latent construct based on the common factor model. The "Big Five personality traits" have been inferred using factor analysis. extraversion spatial ability wisdom: “Two of the more predominant means of assessing wisdom include wisdom-related performance and latent variable measures.” Spearman's g, or the general intelligence factor in psychometrics === Economics === Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly. However, by linking these latent variables to other, observable variables, the values of the latent variables can be inferred from measurements of the observable variables. Quality of life is a latent variable which cannot be measured directly, so observable variables are used to infer quality of life. Observable variables to measure quality of life include wealth, employment, environment, physical and mental health, education, recreation and leisure time, and social belonging. === Medicine === Latent-variable methodology is used in many branches of medicine. A class of problems that naturally lend themselves to latent variables approaches are longitudinal studies where the time scale (e.g. age of participant or time since study baseline) is not synchronized with the trait being studied. For such studies, an unobserved time scale that is synchronized with the trait being studied can be modeled as a transformation of the observed time scale using latent variables. Examples of this include disease progression modeling and modeling of growth (see box). == Inferring latent variables == There exists a range of different model classes and methodology that make use of latent variables and allow inference in the presence of latent variables. Models include: linear mixed-effects models and nonlinear mixed-effects models Hidden Markov models Factor analysis Item response theory Analysis and inference methods include: Principal component analysis Instrumented principal component analysis Partial least squares regression Latent semantic analysis and probabilistic latent semantic analysis EM algorithms Metropolis–Hastings algorithm === Bayesian algorithms and methods === Bayesian statistics is often used for inferring latent variables. Latent Dirichlet allocation The Chinese restaurant process is often used to provide a prior distribution over assignments of objects to latent categories. The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.

    Read more →
  • Web container

    Web container

    A web container (also known as a servlet container; and compare "webcontainer") is the component of a web server that interacts with Jakarta Servlets. A web container is responsible for managing the lifecycle of servlets, mapping a URL to a particular servlet and ensuring that the URL requester has the correct access-rights. A web container handles requests to servlets, Jakarta Server Pages (JSP) files, and other types of files that include server-side code. The Web container creates servlet instances, loads and unloads servlets, creates and manages request and response objects, and performs other servlet-management tasks. A web container implements the web component contract of the Jakarta EE architecture. This architecture specifies a runtime environment for additional web components, including security, concurrency, lifecycle management, transaction, deployment, and other services. == List of Servlet containers == The following is a list of notable applications which implement the Jakarta Servlet specification from Eclipse Foundation, divided depending on whether they are directly sold or not. === Open source Web containers === Apache Tomcat (formerly Jakarta Tomcat) is an open source web container available under the Apache Software License. Apache Tomcat 6 and above are operable as general application container (prior versions were web containers only) Apache Geronimo is a full Java EE 6 implementation by Apache Software Foundation. Enhydra, from Lutris Technologies. GlassFish from Eclipse Foundation (an application server, but includes a web container). Jetty, from the Eclipse Foundation. Also supports SPDY and WebSocket protocols. Open Liberty, from IBM, is a fully compliant Jakarta EE server Virgo from Eclipse Foundation provides modular, OSGi based web containers implemented using embedded Tomcat and Jetty. Virgo is available under the Eclipse Public License. WildFly (formerly JBoss Application Server) is a full Java EE implementation by Red Hat, division JBoss. === Commercial Web containers === iPlanet Web Server, from Oracle. JBoss Enterprise Application Platform from Red Hat, division JBoss is subscription-based/open-source Jakarta EE-based application server. WebLogic Application Server, from Oracle Corporation (formerly developed by BEA Systems). Orion Application Server, from IronFlare. Resin Pro, from Caucho Technology. IBM WebSphere Application Server. SAP NetWeaver.

    Read more →
  • Non-negative matrix factorization

    Non-negative matrix factorization

    Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. NMF finds applications in such fields as astronomy, computer vision, document clustering, missing data imputation, chemometrics, audio signal processing, recommender systems, and bioinformatics. == History == In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution". In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization. It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations. == Background == Let matrix V be the product of the matrices W and H, V = W H . {\displaystyle \mathbf {V} =\mathbf {W} \mathbf {H} \,.} Matrix multiplication can be implemented as computing the column vectors of V as linear combinations of the column vectors in W using coefficients supplied by columns of H. That is, each column of V can be computed as follows: v i = W h i , {\displaystyle \mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,} where vi is the i-th column vector of the product matrix V and hi is the i-th column vector of the matrix H. When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, and H is a p × n matrix then p can be significantly less than both m and n. Here is an example based on a text-mining application: Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix V and, if the factorization worked, it is a reasonable approximation to the input matrix V. From the treatment of matrix multiplication above it follows that each column in the product matrix WH is a linear combination of the 10 column vectors in the features matrix W with coefficients supplied by the coefficients matrix H. This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features. It is useful to think of each feature (column vector) in the features matrix W as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix H represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in W) where each feature is weighted by the feature's cell value from the document's column in H. == Clustering property == NMF has an inherent clustering property, i.e., it automatically clusters the columns of input data V = ( v 1 , … , v n ) {\displaystyle \mathbf {V} =(v_{1},\dots ,v_{n})} . More specifically, the approximation of V {\displaystyle \mathbf {V} } by V ≃ W H {\displaystyle \mathbf {V} \simeq \mathbf {W} \mathbf {H} } is achieved by finding W {\displaystyle W} and H {\displaystyle H} that minimize the error function (using the Frobenius norm) ‖ V − W H ‖ F , {\displaystyle \left\|V-WH\right\|_{F},} subject to W ≥ 0 , H ≥ 0. {\displaystyle W\geq 0,H\geq 0.} , If we furthermore impose an orthogonality constraint on H {\displaystyle \mathbf {H} } , i.e. H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} , then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H {\displaystyle H} gives the cluster membership, i.e., if H k j > H i j {\displaystyle \mathbf {H} _{kj}>\mathbf {H} _{ij}} for all i ≠ k, this suggests that the input data v j {\displaystyle v_{j}} belongs to k {\displaystyle k} -th cluster. The computed W {\displaystyle W} gives the cluster centroids, i.e., the k {\displaystyle k} -th column gives the cluster centroid of k {\displaystyle k} -th cluster. This centroid's representation can be significantly enhanced by convex NMF. When the orthogonality constraint H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too. When the error function to be used is Kullback–Leibler divergence, NMF is identical to the probabilistic latent semantic analysis (PLSA), a popular document clustering method. == Types == === Approximate non-negative matrix factorization === Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to V. The full decomposition of V then amounts to the two non-negative matrices W and H as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative or positive. When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing V into smaller matrices W and H, is that if one's goal is to approximately represent the elements of V by significantly less data, then one has to infer some latent structure in the data. === Convex non-negative matrix factorization === In standard NMF, matrix factor W ∈ R+m × k, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors ( v 1 , … , v n ) {\displaystyle (v_{1},\dots ,v_{n})} . This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal. === Nonnegative rank factorization === In case the nonnegative rank of V is equal to its actual rank, V = WH is called a nonnegative rank factorization (NRF). The problem of finding the NRF of V, if it exists, is known to be NP-hard. === Different cost functions and regularizations === There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between V and WH and possibly by regularization of the W and/or H matrices. Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the original Kullback–Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. The factorization problem in the squared error version of NMF may be stated as: Given a matrix V {\displaystyle \mathbf {V} } find nonnegative matrices W and H that minimize the function F ( W , H ) = ‖ V − W H ‖ F 2 {\displaystyle F(\mathbf {W} ,\mathbf {H} )=\left\|\mathbf {V} -\mathbf {WH} \right\|_{F}^{2}} Another type of NMF for images is based on the total variation norm. When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resulting problem may be called non-negative sparse coding due to the similarity to the sparse coding problem, although it may also still be referred to as NMF. === Online NMF === Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided in streaming fashion. One such use is for collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for o

    Read more →
  • U-matrix

    U-matrix

    The U-matrix (unified distance matrix) is a representation of a self-organizing map (SOM) where the Euclidean distance between the codebook vectors of neighboring neurons is depicted in a grayscale image. This image is used to visualize the data in a high-dimensional space using a 2D image. == Construction procedure == Once the SOM is trained using the input data, the final map is not expected to have any twists. If the map is twist-free, the distance between the codebook vectors of neighboring neurons gives an approximation of the distance between different parts of the underlying data. When such distances are depicted in a grayscale image, light colors depict closely spaced node codebook vectors and darker colors indicate more widely separated node codebook vectors. Thus, groups of light colors can be considered as clusters, and the dark parts as the boundaries between the clusters. This representation can help to visualize the clusters in the high-dimensional spaces, or to automatically recognize them using relatively simple image processing techniques.

    Read more →
  • Random forest

    Random forest

    Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set. The first algorithm for random decision forests was created in 1995 by Tin Kam Ho using the random subspace method, which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.). The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance. == History == The general method of random decision forests was first proposed by Salzberg and Heath in 1993, with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995. Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich. The proper introduction of random forests was made in a paper by Leo Breiman, that has become one of the world's most cited papers. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular: Using out-of-bag error as an estimate of the generalization error. Measuring variable importance through permutation. The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation. == Algorithm == === Preliminaries: decision tree learning === Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate". In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model. === Bagging === The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': f ^ = 1 B ∑ b = 1 B f b ( x ′ ) {\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')} or by taking the plurality vote in the case of classification trees. This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x′: σ = ∑ b = 1 B ( f b ( x ′ ) − f ^ ) 2 B − 1 . {\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat {f}})^{2}}{B-1}}}.} The number B of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. B can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit. === From bagging to random forests === The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with p {\displaystyle p} features, p {\displaystyle {\sqrt {p}}} (rounded down) features are used in each split. For regression problems the inventors recommend p / 3 {\displaystyle p/3} (rounded down) with a minimum node size of 5 as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem. === ExtraTrees === Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are p {\displaystyle {\sqrt {p}}} for classification and p {\displaystyle p} for regression, where p {\displaystyle p} is the number of features in the model. === Random forests for high-dimensional data === The basic random forest procedure may

    Read more →
  • Core FTP

    Core FTP

    Core FTP LE is a freeware secure FTP client for Windows, developed by CoreFTP.com. Features include FTP, SSL/TLS, SFTP via SSH, and HTTP/HTTPS support. Secure FTP clients encrypt account information and data transferred across the internet, protecting data from being seen, or sniffed across networks. Core FTP is a traditional FTP client with local files displayed on the left, remote files on the right. Core FTP Server is a secure FTP server for Windows, developed by CoreFTP.com, starting in 2010. == Licensing == CoreFTP LE is free for personal, educational, non-profit, and business use.

    Read more →
  • Elastic net regularization

    Elastic net regularization

    In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Nevertheless, elastic net regularization is typically more accurate than both methods with regard to reconstruction. == Specification == The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on ‖ β ‖ 1 = ∑ j = 1 p | β j | . {\displaystyle \|\beta \|_{1}=\textstyle \sum _{j=1}^{p}|\beta _{j}|.} Use of this penalty function has several limitations. For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part ( ‖ β ‖ 2 {\displaystyle \|\beta \|^{2}} ) to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by β ^ ≡ argmin β ( ‖ y − X β ‖ 2 + λ 2 ‖ β ‖ 2 + λ 1 ‖ β ‖ 1 ) . {\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).} The quadratic penalty term makes the loss function strongly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where λ 1 = λ , λ 2 = 0 {\displaystyle \lambda _{1}=\lambda ,\lambda _{2}=0} or λ 1 = 0 , λ 2 = λ {\displaystyle \lambda _{1}=0,\lambda _{2}=\lambda } . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ 2 {\displaystyle \lambda _{2}} it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, sometimes the coefficients of the naive version of elastic net is rescaled by multiplying the estimated coefficients by ( 1 + λ 2 ) {\displaystyle (1+\lambda _{2})} . Examples of where the elastic net method has been applied are: Support vector machine Metric learning Portfolio optimization Cancer prognosis == Reduction to support vector machine == It was proven in 2014 that the elastic net can be reduced to the linear support vector machine. A similar reduction was previously proven for the LASSO in 2014. The authors showed that for every instance of the elastic net, an artificial binary classification problem can be constructed such that the hyper-plane solution of a linear support vector machine (SVM) is identical to the solution β {\displaystyle \beta } (after re-scaling). The reduction immediately enables the use of highly optimized SVM solvers for elastic net problems. It also enables the use of GPU acceleration, which is often already used for large-scale SVM solvers. The reduction is a simple transformation of the original data and regularization constants X ∈ R n × p , y ∈ R n , λ 1 ≥ 0 , λ 2 ≥ 0 {\displaystyle X\in {\mathbb {R} }^{n\times p},y\in {\mathbb {R} }^{n},\lambda _{1}\geq 0,\lambda _{2}\geq 0} into new artificial data instances and a regularization constant that specify a binary classification problem and the SVM regularization constant X 2 ∈ R 2 p × n , y 2 ∈ { − 1 , 1 } 2 p , C ≥ 0. {\displaystyle X_{2}\in {\mathbb {R} }^{2p\times n},y_{2}\in \{-1,1\}^{2p},C\geq 0.} Here, y 2 {\displaystyle y_{2}} consists of binary labels − 1 , 1 {\displaystyle {-1,1}} . When 2 p > n {\displaystyle 2p>n} it is typically faster to solve the linear SVM in the primal, whereas otherwise the dual formulation is faster. Some authors have referred to the transformation as Support Vector Elastic Net (SVEN), and provided the following MATLAB pseudo-code: == Software == "Glmnet: Lasso and elastic-net regularized generalized linear models" is a software which is implemented as an R source package and as a MATLAB toolbox. This includes fast algorithms for estimation of generalized linear models with ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two penalties (the elastic net) using cyclical coordinate descent, computed along a regularization path. JMP Pro 11 includes elastic net regularization, using the Generalized Regression personality with Fit Model. "pensim: Simulation of high-dimensional data and parallelized repeated penalized regression" implements an alternate, parallelised "2D" tuning method of the ℓ parameters, a method claimed to result in improved prediction accuracy. scikit-learn includes linear regression and logistic regression with elastic net regularization. SVEN, a Matlab implementation of Support Vector Elastic Net. This solver reduces the Elastic Net problem to an instance of SVM binary classification and uses a Matlab SVM solver to find the solution. Because SVM is easily parallelizable, the code can be faster than Glmnet on modern hardware. SpaSM, a Matlab implementation of sparse regression, classification and principal component analysis, including elastic net regularized regression. Apache Spark provides support for Elastic Net Regression in its MLlib machine learning library. The method is available as a parameter of the more general LinearRegression class. SAS (software) The SAS procedure Glmselect and SAS Viya procedure Regselect support the use of elastic net regularization for model selection.

    Read more →
  • Fitness approximation

    Fitness approximation

    Fitness approximation aims to approximate the objective or fitness functions in evolutionary optimization by building up machine learning models based on data collected from numerical simulations or physical experiments. The machine learning models for fitness approximation are also known as meta-models or surrogates, and evolutionary optimization based on approximated fitness evaluations are also known as surrogate-assisted evolutionary approximation. Fitness approximation in evolutionary optimization can be seen as a sub-area of data-driven evolutionary optimization. == Approximate models in function optimization == === Motivation === In many real-world optimization problems including engineering problems, the number of fitness function evaluations needed to obtain a good solution dominates the optimization cost. In order to obtain efficient optimization algorithms, it is crucial to use prior information gained during the optimization process. Conceptually, a natural approach to utilizing the known prior information is building a model of the fitness function to assist in the selection of candidate solutions for evaluation. A variety of techniques for constructing such a model, often also referred to as surrogates, metamodels or approximation models – for computationally expensive optimization problems have been considered. === Approaches === Common approaches to constructing approximate models based on learning and interpolation from known fitness values of a small population include: Low-degree polynomials and regression models Fourier surrogate modeling Artificial neural networks including Multilayer perceptrons Radial basis function network Support vector machines Due to the limited number of training samples and high dimensionality encountered in engineering design optimization, constructing a globally valid approximate model remains difficult. As a result, evolutionary algorithms using such approximate fitness functions may converge to local optima. Therefore, it can be beneficial to selectively use the original fitness function together with the approximate model.

    Read more →