Unfold (app)

Unfold (app)

Unfold is a mobile application that allows users to create social media content using a variety of templates and other tools. It was founded in 2018 by Alfonso Cobo and Andy McCune. It enables users to add photos, video, and text with a variety of tools. In 2019, Unfold was acquired by Squarespace. == History == In January 2017, Alfonso Cobo was studying at Parsons School of Design when he realized there was no software or app that could create a portfolio of his work on an iPad. Cobo created an app called Portfolio, a basic version of a portfolio layout app, and the first one to exist for iPad. He launched it in 2017. After launching the first version of Portfolio, Cobo realized the more popular market and use case was on mobile. Around that time, Instagram was launching Stories. As a result, Cobo pivoted the app away from portfolios and instead focused on an app to showcase one's stories. Cobo later contacted Andy McCune, founder of social media account Earth, to collaborate with Unfold. Unfold also partnered with various companies to create custom templates. These include Equinox, Tommy Hilfiger, NARS, Billboard Music Awards, and Product Red. Unfold also launched a collection of Product Red templates to help eliminate HIV/AIDS in several African countries. In 2019, Squarespace acquired Unfold. The Unfold app has been downloaded over 60 million times and has been used to create over 1 billion Instagram stories. == Features == With Unfold, users can utilize hundreds of templates to make social content for social media platforms such as Instagram, Snapchat, and Facebook. The free app offers users basic templates and standard fonts, filters, and stickers, and there are also premium templates available for a monthly subscription. With Unfold+ and Unfold Pro (previously Unfold for Brands), users can access premium templates and tools, as well as upload custom brand assets and fonts. In 2020, Unfold launched Bio Sites, which allows users to link to multiple sites and platforms.

Comparison gallery of image scaling algorithms

This gallery shows the results of numerous image scaling algorithms. == Scaling methods == An image size can be changed in several ways. Consider resizing a 160x160 pixel photo to the following 40x40 pixel thumbnail and then scaling the thumbnail to a 160x160 pixel image. Also consider doubling the size of the following image containing text. == Examples of enlarged images == Below are examples of various images enlarged 4x using each scaling algorithm.

Mean squared error

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the true value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution). The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero. The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the true value). For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error. == Definition and basic properties == The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). In the context of prediction, understanding the prediction interval can also be useful as it provides a range within which a future observation will fall, with a certain probability. The definition of an MSE differs according to whether one is describing a predictor or an estimator. === Predictor === If a vector of n {\displaystyle n} predictions is generated from a sample of n {\displaystyle n} data points on all variables, and Y {\displaystyle Y} is the vector of observed values of the variable being predicted, with Y ^ {\displaystyle {\hat {Y}}} being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as MSE = 1 n ∑ i = 1 n ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} In other words, the MSE is the mean ( 1 n ∑ i = 1 n ) {\textstyle \left({\frac {1}{n}}\sum _{i=1}^{n}\right)} of the squares of the errors ( Y i − Y i ^ ) 2 {\textstyle \left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} . This is an easily computable quantity for a particular sample (and hence is sample-dependent). In matrix notation, MSE = 1 n ∑ i = 1 n ( e i ) 2 = 1 n e T e {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(e_{i})^{2}={\frac {1}{n}}\mathbf {e} ^{\mathsf {T}}\mathbf {e} } where e i {\displaystyle e_{i}} is Y i − Y i ^ {\displaystyle Y_{i}-{\hat {Y_{i}}}} and e {\displaystyle \mathbf {e} } is a n × 1 {\displaystyle n\times 1} column vector. The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as cross-validation, the MSE is often called the test MSE, and is computed as MSE = 1 q ∑ i = n + 1 n + q ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{q}}\sum _{i=n+1}^{n+q}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} === Estimator === The MSE of an estimator θ ^ {\displaystyle {\hat {\theta }}} with respect to an unknown parameter θ {\displaystyle \theta } is defined as MSE ⁡ ( θ ^ ) = E θ ⁡ [ ( θ ^ − θ ) 2 ] . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right].} This definition depends on the unknown parameter, therefore the MSE is a priori property of an estimator. The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator θ ^ {\displaystyle {\hat {\theta }}} is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the sampling distribution of the sample statistic. The MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent. MSE ⁡ ( θ ^ ) = Var θ ⁡ ( θ ^ ) + Bias ⁡ ( θ ^ , θ ) 2 . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} ({\hat {\theta }},\theta )^{2}.} ==== Proof of variance and bias relationship ==== MSE ⁡ ( θ ^ ) = E θ ⁡ [ ( θ ^ − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] + E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 + 2 ( θ ^ − E θ ⁡ [ θ ^ ] ) ( E θ ⁡ [ θ ^ ] − θ ) + ( E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + E θ ⁡ [ 2 ( θ ^ − E θ ⁡ [ θ ^ ] ) ( E θ ⁡ [ θ ^ ] − θ ) ] + E θ ⁡ [ ( E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + 2 ( E θ ⁡ [ θ ^ ] − θ ) E θ ⁡ [ θ ^ − E θ ⁡ [ θ ^ ] ] + ( E θ ⁡ [ θ ^ ] − θ ) 2 E θ ⁡ [ θ ^ ] − θ = constant = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + 2 ( E θ ⁡ [ θ ^ ] − θ ) ( E θ ⁡ [ θ ^ ] − E θ ⁡ [ θ ^ ] ) + ( E θ ⁡ [ θ ^ ] − θ ) 2 E θ ⁡ [ θ ^ ] = constant = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + ( E θ ⁡ [ θ ^ ] − θ ) 2 = Var θ ⁡ ( θ ^ ) + Bias θ ⁡ ( θ ^ , θ ) 2 {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]+\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}+2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\operatorname {E} _{\theta }\left[2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\right]+\operatorname {E} _{\theta }\left[\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\operatorname {E} _{\theta }\left[{\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta ={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\\&=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} _{\theta }({\hat {\theta }},\theta )^{2}\end{aligned}}} An even shorter proof can be achieved using the well-known formula that for a random variable X {\textstyle X} , E ( X 2 ) = Var ⁡ ( X ) + ( E ( X ) ) 2 {\textstyle \mathbb {E} (X^{2})=\operatorname {Var} (X)+(\mathbb {E} (X))^{2}} . By substituting X {\textstyle X} with, θ ^ − θ {\textstyle {\hat {\theta }}-\theta } , we have MSE ⁡ ( θ ^ ) = E [ ( θ ^ − θ ) 2 ] = Var ⁡ ( θ ^ − θ ) + ( E [ θ ^ − θ ] ) 2 = Var ⁡ ( θ ^ ) + Bias 2 ⁡ ( θ ^ , θ ) {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]\\&=\operator

Self-play

Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves". == Definition and motivation == In multi-agent reinforcement learning experiments, researchers try to optimize the performance of a learning agent on a given task, in cooperation or competition with one or more agents. These agents learn by trial-and-error, and researchers may choose to have the learning algorithm play the role of two or more of the different agents. When successfully executed, this technique has a double advantage: It provides a straightforward way to determine the actions of the other agents, resulting in a meaningful challenge. It increases the amount of experience that can be used to improve the policy, by a factor of two or more, since the viewpoints of each of the different agents can be used for learning. Czarnecki et al argue that most of the games that people play for fun are "Games of Skill", meaning games whose space of all possible strategies looks like a spinning top. In more detail, we can partition the space of strategies into sets L 1 , L 2 , . . . , L n {\displaystyle L_{1},L_{2},...,L_{n}} , such that any i < j , π i ∈ L i , π j ∈ L j {\displaystyle i

Geographical cluster

A geographical cluster is a localized anomaly, usually an excess of something given the distribution or variation of something else. Often it is considered as an incidence rate that is unusual in that there is more of some variable than might be expected. Examples would include: a local excess disease rate, a crime hot spot, areas of high unemployment, accident blackspots, unusually high positive residuals from a model, high concentrations of flora or fauna, physical features or events like earthquake epicenters etc... Identifying these extreme regions may be useful in that there could be implicit geographical associations with other variables that can be identified and would be of interest. Pattern detection via the identification of such geographical clusters is a very simple and generic form of geographical analysis that has many applications in many different contexts. The emphasis is on localized clustering or patterning because this may well contain the most useful information. A geographical cluster is different from a high concentration as it is generally second order, involving the factoring in of the distribution of something else. == Geographical cluster detection == Identifying geographical clusters can be an important stage in a geographical analysis. Mapping the locations of unusual concentrations may help identify causes of these. Some techniques include the Geographical Analysis Machine and Besag and Newell's cluster detection method.

Knowledge graph embedding

In representation learning, knowledge graph embedding (KGE), also called knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction. == Definition == A knowledge graph G = { E , R , F } {\displaystyle {\mathcal {G}}=\{E,R,F\}} is a collection of entities E {\displaystyle E} , relations R {\displaystyle R} , and facts F {\displaystyle F} . A fact is a triple ( h , r , t ) ∈ F {\displaystyle (h,r,t)\in F} that denotes a link r ∈ R {\displaystyle r\in R} between the head h ∈ E {\displaystyle h\in E} and the tail t ∈ E {\displaystyle t\in E} of the triple. Another notation that is often used in the literature to represent a triple (or fact) is ⟨ head , relation , tail ⟩ {\displaystyle \langle {\text{head}},{\text{relation}},{\text{tail}}\rangle } . This notation is called the Resource Description Framework (RDF). A knowledge graph represents the knowledge related to a specific domain; leveraging this structured representation, it is possible to infer a piece of new knowledge from it after some refinement steps. However, nowadays, people have to deal with the sparsity of data and the computational inefficiency to use them in a real-world application. The embedding of a knowledge graph is a function that translates each entity and each relation into a vector of a given dimension d {\displaystyle d} , called embedding dimension. It is even possible to embed the entities and relations with different dimensions. The embedding vectors can then be used for other tasks. A knowledge graph embedding is characterized by four aspects: Representation space: The low-dimensional space in which the entities and relations are represented. Scoring function: A measure of the goodness of a triple-embedded representation. Encoding models: The modality in which the embedded representation of the entities and relations interact with each other. Additional information: Any additional information coming from the knowledge graph that can enrich the embedded representation. Usually, an ad hoc scoring function is integrated into the general scoring function for each additional piece of information. == Embedding procedure == All algorithms for creating a knowledge graph embedding follow the same approach. First, the embedding vectors are initialized to random values. Then, they are iteratively optimized using a training set of triples. In each iteration, a batch of size b {\displaystyle b} triples is sampled from the training set, and a triple from it is sampled and corrupted—i.e., a triple that does not represent a true fact in the knowledge graph. The corruption of a triple involves substituting the head or the tail (or both) of the triple with another entity that makes the fact false. The original triple and the corrupted triple are added in the training batch, and then the embeddings are updated, optimizing a scoring function. Iteration stops when a stop condition is reached. Usually, the stop condition depends on the overfitting of the training set. At the end, the learned embeddings should have extracted semantic meaning from the training triples and should correctly predict unseen true facts in the knowledge graph. === Pseudocode === The following is the pseudocode for the general embedding procedure. algorithm Compute entity and relation embeddings input: The training set S = { ( h , r , t ) } {\displaystyle S=\{(h,r,t)\}} , entity set E {\displaystyle E} , relation set R {\displaystyle R} , embedding dimension k {\displaystyle k} output: Entity and relation embeddings initialization: the entities e {\displaystyle e} and relations r {\displaystyle r} embeddings (vectors) are randomly initialized while stop condition do S b a t c h ← s a m p l e ( S , b ) {\displaystyle S_{batch}\leftarrow sample(S,b)} // Sample a batch from the training set for each ( h , r , t ) {\displaystyle (h,r,t)} in S b a t c h {\displaystyle S_{batch}} do ( h ′ , r , t ′ ) ← s a m p l e ( S ′ ) {\displaystyle (h',r,t')\leftarrow sample(S')} // Sample a corrupted fact T b a t c h ← T b a t c h ∪ { ( ( h , r , t ) , ( h ′ , r , t ′ ) ) } {\displaystyle T_{batch}\leftarrow T_{batch}\cup \{((h,r,t),(h',r,t'))\}} end for Update embeddings by minimizing the loss function end while == Performance indicators == These indexes are often used to measure the embedding quality of a model. The simplicity of the indexes makes them very suitable for evaluating the performance of an embedding algorithm even on a large scale. Given Q {\displaystyle {\ce {Q}}} as the set of all ranked predictions of a model, it is possible to define three different performance indexes: Hits@K, MR, and MRR. === Hits@K === Hits@K or in short, H@K, is a performance index that measures the probability to find the correct prediction in the first top K model predictions. Usually, it is used k = 10 {\displaystyle k=10} . Hits@K reflects the accuracy of an embedding model to predict the relation between two given triples correctly. Hits@K = | { q ∈ Q : q < k } | | Q | ∈ [ 0 , 1 ] {\displaystyle ={\frac {|\{q\in Q:q

Polynomial kernel

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features. == Definition == For degree-d polynomials, the polynomial kernel is defined as K ( x , y ) = ( x T y + c ) d {\displaystyle K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}} where x and y are vectors of size n in the input space, i.e. vectors of features computed from training or test samples and c ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When c = 0, the kernel is called homogeneous. (A further generalized polykernel divides xTy by a user-specified scalar parameter a.) As a kernel, K corresponds to an inner product in a feature space based on some mapping φ: K ( x , y ) = ⟨ φ ( x ) , φ ( y ) ⟩ {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle } The nature of φ can be seen from an example. Let d = 2, so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping, K ( x , y ) = ( ∑ i = 1 n x i y i + c ) 2 = ∑ i = 1 n ( x i 2 ) ( y i 2 ) + ∑ i = 2 n ∑ j = 1 i − 1 ( 2 x i x j ) ( 2 y i y j ) + ∑ i = 1 n ( 2 c x i ) ( 2 c y i ) + c 2 {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}} From this it follows that the feature map is given by: φ ( x ) = ( x n 2 , … , x 1 2 , 2 x n x n − 1 , … , 2 x n x 1 , 2 x n − 1 x n − 2 , … , 2 x n − 1 x 1 , … , 2 x 2 x 1 , 2 c x n , … , 2 c x 1 , c ) {\displaystyle \varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)} generalizing for ( x T y + c ) d {\displaystyle \left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}} , where x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} , y ∈ R n {\displaystyle \mathbf {y} \in \mathbb {R} ^{n}} and applying the multinomial theorem: ( x T y + c ) d = ∑ j 1 + j 2 + ⋯ + j n + 1 = d d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 d ! j 1 ! ⋯ j n ! j n + 1 ! y 1 j 1 ⋯ y n j n c j n + 1 = φ ( x ) T φ ( y ) {\displaystyle {\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}} The last summation has l d = ( n + d d ) {\displaystyle l_{d}={\tbinom {n+d}{d}}} elements, so that: φ ( x ) = ( a 1 , … , a l , … , a l d ) {\displaystyle \varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)} where l = ( j 1 , j 2 , . . . , j n , j n + 1 ) {\displaystyle l=(j_{1},j_{2},...,j_{n},j_{n+1})} and a l = d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 | j 1 + j 2 + ⋯ + j n + j n + 1 = d {\displaystyle a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d} == Practical use == Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP). The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including: full expansion of the kernel prior to training/testing with a linear SVM, i.e. full computation of the mapping φ as in polynomial regression; basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion; inverted indexing of support vectors. One problem with the polynomial kernel is that it may suffer from numerical instability: when xTy + c < 1, K(x, y) = (xTy + c)d tends to zero with increasing d, whereas when xTy + c > 1, K(x, y) tends to infinity.