A pamphlet war is a protracted argument or discussion through printed media, especially between the time the printing press became common, and when state intervention like copyright laws made such public discourse more difficult. The purpose was to defend or attack a certain perspective or idea. Pamphlet wars have occurred multiple times throughout history, as both social and political platforms. Pamphlet wars became viable platforms for this protracted discussion with the advent and spread of the printing press. Cheap printing presses, and increased literacy made the late 17th century a key stepping stone for the development of pamphlet wars, a period of prolific use of this type of debate. Over 2200 pamphlets were published between 1600–1715 alone. Pamphlet wars are generally credited for powering many key social changes of the era, including the Reformation and the Revolution Controversy, the English philosophical debate set off by the French Revolution. == History of the pamphlet in England == Throughout Europe in the 16th century, printed tracts were used to argue religious doctrine and foment support for religious causes. In England, Henry VIII used print literature to justify his break from the Catholic Church. During the subsequent reigns of Edward and Mary, print polemics escalated into propaganda warfare, as print media gained enormous potential to sway common opinion. By the 1560s, print was widely used to convey news. In 1562, the first pamphlets appeared, which discussed the English forces sent to aid the Protestant French Huguenots. In 1569, pamphlets reported the revolt of the Northern Earls and the subsequent Rebellion of the same year. In the 1580s, pamphlets began to replace broadsheet ballads as the means to convey information to the general public. Over the next century, the pamphlet became the principal means of garnering support for a cause or an idea, and was particularly influential during the English Civil Wars (1642-1651) and the Glorious Revolution of 1688. Through the ensuing decades, the pamphlet lost some popularity due to the emergence of newspapers and journals, but continued to be an important medium of public debate, as illustrated by the Revolution Controversy a full century later in the 1790s. == Pamphlet printing == Coming from a Latin word, "pamphlet" literally means "small book." In the early days of printing, the format of the book or pamphlet depended on the size of the paper used and the number of times it was folded. If a page was only folded once, it was called a folio. If it was folded twice, it was known as a quarto. An octave was a paper folded three times. A pamphlet was usually 1-12 sheets of paper folded in quarto, or 8-96 pages. It was sold for one or two pennies apiece. The printing of a pamphlet involved many people: the author, the printer, suppliers, print-makers, compositor, correctors, pressmen, binders, and distributors. Once the pamphleteer had written the pamphlet, it was sent to the printing house to be corrected, set into type, and printed. The papers were then given to the printer's warehouse-keeper, who bundled the copies and sent them to the bookseller, who was probably the one financing the printing. He was responsible to bind the pamphlets, usually by sewing them, and then sold them wholesale to individual bookselling vendors. The booksellers then sold them from a stall in the marketplace. == Pamphlet subjects == Pamphlets began as the means of conveyance for religious debates, and therefore religious topics were one of the main subjects they dealt with. The definition of a pamphlet came to mean a short work dealing with social, political, or religious issues. Typical topics included the Civil war, Church of England doctrines, Acts of Parliament, the Popish Plot (see below), the Stuart Era, and Cromwell propaganda. In addition, pamphlets were also used for romantic fiction, autobiography, scurrilous personal abuse, and social criticism. They contained much of the propaganda of the 17th century in the midst of the religious and political turmoil. They were also used for debates between the Puritans and the Anglican. During the Glorious Revolution, pamphlets were political weapons. == Authors == There were many authors of pamphlets. However some of the more popular authors include Daniel Defoe, Thomas Hobbes, Jonathan Swift, John Milton, and Samuel Pepys. Also included in the midst are Thomas Nashe, Joseph Addison, Richard Steele, and Matthew Prior. In 1591–1592, Robert Greene released a series of pamphlets which later inspired many other authors including Thomas Middleton and Thomas Dekker. == Critics == Pamphlets, along with their vast popularity, received criticism. There were many in the time period who believed that pamphlets were full of foolishness. They thought the pamphlets were not good enough literature and that they would turn people from "good" writing. They believed that pamphlets would be the end of the great volumes of literature and that great writing would be forgotten. == News reporting == Pamphlets made a great difference in the way news was reported to the general public. With the publication of pamphlets, it was no longer difficult for people to hear of events taking place far away. The closer the occurrence was to London, the easier and faster people heard of it. For example, the Battle of Edgehill took place on 23 October 1642. The first pamphlet reporting the incident was printed on 25 October 24 hours after some of the orders reported had been given. While not entirely accurate, and hurriedly made, the pamphlet nonetheless was able to tell the general public what had happened in the battle. A more accurate, specific, and readable account was available in a pamphlet printed on 26 October, and the "authorized" version was available only five days after the battle took place. == Marprelate pamphlets == In 1588, a series of pamphlets marked a turning point for the Puritans, dividing them from other Protestants in the country. The authors wrote under the pseudonym of Martin Marprelate and his two sons of the same name. The true identities of the authors were never discovered. The pamphlets aimed to provoke authorities to take action against censorship. The series was among the first to ask questions directly of its readers. == Early pamphlet wars == === Elizabethan pamphlet wars === As a means of forming or swaying public opinion, pamphlets like these had a part in influencing society, even as the content was itself influenced by society. During the 16th century and continuing for a short while in the early 17th century in England there was rise in the use of pamphlet wars to discuss a myriad of issues spanning from the civil war, to religious freedoms and the roles of women in society. The Queen herself participated in these discussions, making sure that she was widely read and understood by her people in order to gain favour and establish herself as the monarch despite being a woman. Examples of her use of this medium appear in To the Troops at Tilbury written in 1588, On Mary's Execution written in 1586, and many more. Another famous writer of this period to take advantage of the pamphlet was Emilia Lanier, famous for her arguments about the role of women. A common idea promoted by many literary works and the general attitude towards women, Lanier's work "Eve's Apology in Defence of Women" refuted the belief that Eve is responsible for the fall of man. A very uncommon and unpopular stance to take, Lanier accomplishes her defence through structuring it as an apology, one of the earliest subversive feminist texts. Similarly, Francis Bacon wrote his Essays to promote his idea of morality and other complicated social issues. For example, his work, "Of Love" examines the various understandings of the concept of love, particularly as it was perceived during the Elizabethan era. === Eikon Series === From 1649 until 1651, some five pamphlets were published in a debate about the execution of King Charles I of England (1600-1649). Prior to his execution, King Charles wrote the first pamphlet in the discussion, Eikon Basilike’’ (from the Greek “eikon” for image and “basileus” for king). The subtitle of this work - Portraiture of His Sacred Majesty in His Solitudes and Sufferings - indicates that Charles sought to portray himself as a martyr to the cause of regal prerogative. In the following months, several response pamphlets were published (collectively known as the "Eikon" series), including: Eikon Alethine, Eikon e Pistes, Eikonoklastes, and Eikon Aklastos,” alternately attacking or defending the king, his regicide, and his self-portrait in “Eikon Basilike.” == Popish Plot and Elizabeth Cellier == In the 1680s, after being acquitted of the "Meal-Tub Plot" for which she was accused, Elizabeth Cellier wrote Malice Defeated, which, along with The Matchless Picaro, sparked a pamphlet war surrounding debate of the ascension of a Catholic king to the thro
Wavelet noise
Wavelet noise is an alternative to Perlin noise which reduces the problems of aliasing and detail loss that are encountered when Perlin noise is summed into a fractal. == Algorithm detail == The basic algorithm for 2-dimensional wavelet noise is as follows: Create an image, R {\displaystyle R} , filled with uniform white noise. Downsample R {\displaystyle R} to half-size to create R ↓ {\displaystyle R^{\downarrow }} , then upsample it back up to full size to create R ↓↑ {\displaystyle R^{\downarrow \uparrow }} . Subtract R ↓↑ {\displaystyle R^{\downarrow \uparrow }} from R {\displaystyle R} to create the end result, N {\displaystyle N} . This results in an image that contains all the information that cannot be represented at half-scale. From here, N {\displaystyle N} can be used similarly to Perlin noise to create fractal patterns.
Growth function
The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family or class of functions. It is especially used in the context of statistical learning theory, where it is used to study properties of statistical learning methods. The term 'growth function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also proved many of its properties. It is a basic concept in machine learning. == Definitions == === Set-family definition === Let H {\displaystyle H} be a set family (a set of sets) and C {\displaystyle C} a set. Their intersection is defined as the following set-family: H ∩ C := { h ∩ C ∣ h ∈ H } {\displaystyle H\cap C:=\{h\cap C\mid h\in H\}} The intersection-size (also called the index) of H {\displaystyle H} with respect to C {\displaystyle C} is | H ∩ C | {\displaystyle |H\cap C|} . If a set C m {\displaystyle C_{m}} has m {\displaystyle m} elements then the index is at most 2 m {\displaystyle 2^{m}} . If the index is exactly 2m then the set C {\displaystyle C} is said to be shattered by H {\displaystyle H} , because H ∩ C {\displaystyle H\cap C} contains all the subsets of C {\displaystyle C} , i.e.: | H ∩ C | = 2 | C | , {\displaystyle |H\cap C|=2^{|C|},} The growth function measures the size of H ∩ C {\displaystyle H\cap C} as a function of | C | {\displaystyle |C|} . Formally: Growth ( H , m ) := max C : | C | = m | H ∩ C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H\cap C|} === Hypothesis-class definition === Equivalently, let H {\displaystyle H} be a hypothesis-class (a set of binary functions) and C {\displaystyle C} a set with m {\displaystyle m} elements. The restriction of H {\displaystyle H} to C {\displaystyle C} is the set of binary functions on C {\displaystyle C} that can be derived from H {\displaystyle H} : H C := { ( h ( x 1 ) , … , h ( x m ) ) ∣ h ∈ H , x i ∈ C } {\displaystyle H_{C}:=\{(h(x_{1}),\ldots ,h(x_{m}))\mid h\in H,x_{i}\in C\}} The growth function measures the size of H C {\displaystyle H_{C}} as a function of | C | {\displaystyle |C|} : Growth ( H , m ) := max C : | C | = m | H C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H_{C}|} == Examples == 1. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the half-lines (rays) from a given number to positive infinity, i.e., all sets of the form { x > x 0 ∣ x ∈ R } {\displaystyle \{x>x_{0}\mid x\in \mathbb {R} \}} for some x 0 ∈ R {\displaystyle x_{0}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains m + 1 {\displaystyle m+1} sets: the empty set, the set containing the largest element of C {\displaystyle C} , the set containing the two largest elements of C {\displaystyle C} , and so on. Therefore: Growth ( H , m ) = m + 1 {\displaystyle \operatorname {Growth} (H,m)=m+1} . The same is true whether H {\displaystyle H} contains open half-lines, closed half-lines, or both. 2. The domain is the segment [ 0 , 1 ] {\displaystyle [0,1]} . The set-family H {\displaystyle H} contains all the open sets. For any finite set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all possible subsets of C {\displaystyle C} . There are 2 m {\displaystyle 2^{m}} such subsets, so Growth ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} . 3. The domain is the Euclidean space R n {\displaystyle \mathbb {R} ^{n}} . The set-family H {\displaystyle H} contains all the half-spaces of the form: x ⋅ ϕ ≥ 1 {\displaystyle x\cdot \phi \geq 1} , where ϕ {\displaystyle \phi } is a fixed vector. Then Growth ( H , m ) = Comp ( n , m ) {\displaystyle \operatorname {Growth} (H,m)=\operatorname {Comp} (n,m)} , where Comp is the number of components in a partitioning of an n-dimensional space by m hyperplanes. 4. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the real intervals, i.e., all sets of the form { x ∈ [ x 0 , x 1 ] | x ∈ R } {\displaystyle \{x\in [x_{0},x_{1}]|x\in \mathbb {R} \}} for some x 0 , x 1 ∈ R {\displaystyle x_{0},x_{1}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all runs of between 0 and m {\displaystyle m} consecutive elements of C {\displaystyle C} . The number of such runs is ( m + 1 2 ) + 1 {\displaystyle {m+1 \choose 2}+1} , so Growth ( H , m ) = ( m + 1 2 ) + 1 {\displaystyle \operatorname {Growth} (H,m)={m+1 \choose 2}+1} . == Polynomial or exponential == The main property that makes the growth function interesting is that it can be either polynomial or exponential - nothing in-between. The following is a property of the intersection-size: If, for some set C m {\displaystyle C_{m}} of size m {\displaystyle m} , and for some number n ≤ m {\displaystyle n\leq m} , | H ∩ C m | ≥ Comp ( n , m ) {\displaystyle |H\cap C_{m}|\geq \operatorname {Comp} (n,m)} - then, there exists a subset C n ⊆ C m {\displaystyle C_{n}\subseteq C_{m}} of size n {\displaystyle n} such that | H ∩ C n | = 2 n {\displaystyle |H\cap C_{n}|=2^{n}} . This implies the following property of the Growth function. For every family H {\displaystyle H} there are two cases: The exponential case: Growth ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} identically. The polynomial case: Growth ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} is majorized by Comp ( n , m ) ≤ m n + 1 {\displaystyle \operatorname {Comp} (n,m)\leq m^{n}+1} , where n {\displaystyle n} is the smallest integer for which Growth ( H , n ) < 2 n {\displaystyle \operatorname {Growth} (H,n)<2^{n}} . == Other properties == === Trivial upper bound === For any finite H {\displaystyle H} : Growth ( H , m ) ≤ | H | {\displaystyle \operatorname {Growth} (H,m)\leq |H|} since for every C {\displaystyle C} , the number of elements in H ∩ C {\displaystyle H\cap C} is at most | H | {\displaystyle |H|} . Therefore, the growth function is mainly interesting when H {\displaystyle H} is infinite. === Exponential upper bound === For any nonempty H {\displaystyle H} : Growth ( H , m ) ≤ 2 m {\displaystyle \operatorname {Growth} (H,m)\leq 2^{m}} I.e, the growth function has an exponential upper-bound. We say that a set-family H {\displaystyle H} shatters a set C {\displaystyle C} if their intersection contains all possible subsets of C {\displaystyle C} , i.e. H ∩ C = 2 C {\displaystyle H\cap C=2^{C}} . If H {\displaystyle H} shatters C {\displaystyle C} of size m {\displaystyle m} , then Growth ( H , C ) = 2 m {\displaystyle \operatorname {Growth} (H,C)=2^{m}} , which is the upper bound. === Cartesian intersection === Define the Cartesian intersection of two set-families as: H 1 ⨂ H 2 := { h 1 ∩ h 2 ∣ h 1 ∈ H 1 , h 2 ∈ H 2 } {\displaystyle H_{1}\bigotimes H_{2}:=\{h_{1}\cap h_{2}\mid h_{1}\in H_{1},h_{2}\in H_{2}\}} . Then: Growth ( H 1 ⨂ H 2 , m ) ≤ Growth ( H 1 , m ) ⋅ Growth ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\bigotimes H_{2},m)\leq \operatorname {Growth} (H_{1},m)\cdot \operatorname {Growth} (H_{2},m)} === Union === For every two set-families: Growth ( H 1 ∪ H 2 , m ) ≤ Growth ( H 1 , m ) + Growth ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\cup H_{2},m)\leq \operatorname {Growth} (H_{1},m)+\operatorname {Growth} (H_{2},m)} === VC dimension === The VC dimension of H {\displaystyle H} is defined according to these two cases: In the polynomial case, VCDim ( H ) = n − 1 {\displaystyle \operatorname {VCDim} (H)=n-1} = the largest integer d {\displaystyle d} for which Growth ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . In the exponential case VCDim ( H ) = ∞ {\displaystyle \operatorname {VCDim} (H)=\infty } . So VCDim ( H ) ≥ d {\displaystyle \operatorname {VCDim} (H)\geq d} if-and-only-if Growth ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . The growth function can be regarded as a refinement of the concept of VC dimension. The VC dimension only tells us whether Growth ( H , d ) {\displaystyle \operatorname {Growth} (H,d)} is equal to or smaller than 2 d {\displaystyle 2^{d}} , while the growth function tells us exactly how Growth ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} changes as a function of m {\displaystyle m} . Another connection between the growth function and the VC dimension is given by the Sauer–Shelah lemma: If VCDim ( H ) = d {\displaystyle \operatorname {VCDim} (H)=d} , then: for all m {\displaystyle m} : Growth ( H , m ) ≤ ∑ i = 0 d ( m i ) {\displaystyle \operatorname {Growth} (H,m)\leq \sum _{i=0}^{d}{m \choose i}} In particular, for all m > d + 1 {\displaystyle m>d+1} : Growth ( H , m ) ≤ ( e m / d ) d = O ( m d ) {\displaystyle \operatorname {Growth} (H,m)\leq (
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics. Methods are commonly divided into linear and nonlinear approaches. Linear approaches can be further divided into feature selection and feature extraction. Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses. == Feature selection == The process of feature selection aims to find a suitable subset of the input variables (features, or attributes) for the task at hand. The three strategies are: the filter strategy (e.g., information gain), the wrapper strategy (e.g., accuracy-guided search), and the embedded strategy (features are added or removed while building the model based on prediction errors). Data analysis such as regression or classification can be done in the reduced space more accurately than in the original space. == Feature projection == Feature projection (also called feature extraction) transforms the data from the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning. === Principal component analysis (PCA) === The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system, because they often contribute the vast majority of the system's energy, especially in low-dimensional systems. Still, this must be proved on a case-by-case basis as not all systems exhibit this behavior. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors. === Non-negative matrix factorization (NMF) === NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist, such as astronomy. NMF is well known since the multiplicative update rule by Lee & Seung, which has been continuously developed: the inclusion of uncertainties, the consideration of missing data and parallel computation, sequential construction which leads to the stability and linearity of NMF, as well as other updates including handling missing data in digital image processing. With a stable component basis during construction, and a linear modeling process, sequential NMF is able to preserve the flux in direct imaging of circumstellar structures in astronomy, as one of the methods of detecting exoplanets, especially for the direct imaging of circumstellar discs. In comparison with PCA, NMF does not remove the mean of the matrices, which leads to physical non-negative fluxes; therefore NMF is able to preserve more information than PCA as demonstrated by Ren et al. === Kernel PCA === Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is called kernel PCA. === Graph-based kernel PCA === Other prominent nonlinear techniques include manifold learning techniques such as Isomap, locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and methods based on tangent space analysis. These techniques assume that the high-dimensional input data lies near a low-dimensional manifold embedded in the ambient space, and construct a low-dimensional representation using a cost function that retains local properties of the data; they can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space) while maximizing the distances between points that are not nearest neighbors. An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include: classical multidimensional scaling, which is identical to PCA; Isomap, which uses geodesic distances in the data space; diffusion maps, which use diffusion distances in the data space; t-distributed stochastic neighbor embedding (t-SNE), which minimizes the divergence between distributions over pairs of points; and curvilinear component analysis. A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feedforward neural networks with a bottleneck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation. === Linear discriminant analysis (LDA) === Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. === Generalized discriminant analysis (GDA) === GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support-vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space. Similar to LDA, the objective of GDA is to find a projection for the features into a lower dimensional space by maximizing the ratio of between-class scatter to within-class scatter. === Autoencoder === Autoencoders can be used to learn nonlinear dimension reduction functions and codings together with an inverse function from the coding to the original representation. === t-SNE === T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique useful for the visualization of high-dimensional datasets. It is not recommended for use in analysis such as clustering or outlier detection since it does not necessarily preserve densities or distances well. === UMAP === Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique. Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant. == Dimension reduction == For high-dimensional datasets, dimension reduction is usually performed prior to applying a k-nearest neighbors (k-NN) algorithm in order to mitigate the curse of dimensionality. Feature extraction and dimension reduction can be combined in one step, using principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF) techniques to pre-process the data, followed by clustering via k-NN on feature vectors in a reduced-dimension space. In machine learning, this process is also called low-dimensional embedding. For high-dimensional datasets (e.g., when performing similarity search on live video streams, DNA data, or high-dimensional time series), running a fast approximate k-NN search using locality-sensitive hashing, random projection, "sketches", or other high-dimensional similarity search techniques from the VLDB conference toolbox may be the only fe
Multiclass classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes (banana, peach, orange, apple), while deciding on whether an image contains an apple or not is a binary classification problem (with the two possible classes being: apple, no apple). While many classification algorithms (e.g., decision trees, k-NN, neural networks and multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms (e.g., classical binary support vector machine) and require decomposition strategies such as one-vs-all, one-vs-one, or ECOC to solve multiclass problems. Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance (e.g., predicting that an image contains both an apple and an orange, in the previous example). == Better-than-random multiclass models == From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K ≥ 3 {\displaystyle K\geq 3} be the number of classes, O {\displaystyle {\mathcal {O}}} a set of observations, y ^ : O → { 1 , . . . , K } {\displaystyle {\hat {y}}:{\mathcal {O}}\to \{1,...,K\}} a model of the target variable y : O → { 1 , . . . , K } {\displaystyle y:{\mathcal {O}}\to \{1,...,K\}} and n i , j {\displaystyle n_{i,j}} be the number of observations in the set { y = i } ∩ { y ^ = j } {\displaystyle \{y=i\}\cap \{{\hat {y}}=j\}} . We note n i . = ∑ j n i , j {\displaystyle n_{i.}=\sum _{j}n_{i,j}} , n . j = ∑ i n i , j {\displaystyle n_{.j}=\sum _{i}n_{i,j}} , n = ∑ j n . j = ∑ i n i . {\displaystyle n=\sum _{j}n_{.j}=\sum _{i}n_{i.}} , λ i = n i . n {\displaystyle \lambda _{i}={\frac {n_{i.}}{n}}} and μ j = n . j n {\displaystyle \mu _{j}={\frac {n_{.j}}{n}}} . It is assumed that the confusion matrix ( n i , j ) i , j {\displaystyle (n_{i,j})_{i,j}} contains at least one non-zero entry in each row, that is λ i > 0 {\displaystyle \lambda _{i}>0} for any i {\displaystyle i} . Finally we call "normalized confusion matrix" the matrix of conditional probabilities ( P ( y ^ = j ∣ y = i ) ) i , j = ( n i , j n i . ) i , j {\displaystyle (\mathbb {P} ({\hat {y}}=j\mid y=i))_{i,j}=\left({\frac {n_{i,j}}{n_{i.}}}\right)_{i,j}} . === Intuitive explanation === The lift is a way of measuring the deviation from independence of two events A {\displaystyle A} and B {\displaystyle B} : L i f t ( A , B ) = P ( A ∩ B ) P ( A ) P ( B ) = P ( A ∣ B ) P ( A ) = P ( B ∣ A ) P ( B ) {\displaystyle \mathrm {Lift} (A,B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (A)\mathbb {P} (B)}}={\frac {\mathbb {P} (A\mid B)}{\mathbb {P} (A)}}={\frac {\mathbb {P} (B\mid A)}{\mathbb {P} (B)}}} We have L i f t ( A , B ) > 1 {\displaystyle \mathrm {Lift} (A,B)>1} if and only if events A {\displaystyle A} and B {\displaystyle B} occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} . And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R i {\displaystyle R_{i}} of the confusion matrix by a constant c i {\displaystyle c_{i}} . Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i {\displaystyle i} , we define the binary target variable y i {\displaystyle y_{i}} which is the indicator of event { y = i } {\displaystyle \{y=i\}} , and the binary model y ^ i {\displaystyle {\hat {y}}_{i}} of y i {\displaystyle y_{i}} which is the indicator of event { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} . Each of the y ^ i {\displaystyle {\hat {y}}_{i}} models is a "One versus Rest" model. L i f t ( y = i , y ^ = i ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)} only depends on the events { y = i } {\displaystyle \{y=i\}} and { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} , so merging or not merging the other classes doesn't change its value. We therefore have L i f t ( y = i , y ^ = i ) = L i f t ( y i = 1 , y ^ i = 1 ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)=\mathrm {Lift} (y_{i}=1,{\hat {y}}_{i}=1)} and the first condition is that all binary One versus Rest models are better than chance. ==== Example ==== If K = 2 {\displaystyle K=2} and 2 is the class of interest , the normalized confusion matrix is ( s p e c i f i c i t y 1 − s p e c i f i c i t y 1 − s e n s i t i v i t y s e n s i t i v i t y ) {\displaystyle {\begin{pmatrix}\mathrm {specificity} &1-\mathrm {specificity} \\1-\mathrm {sensitivity} &\mathrm {sensitivity} \end{pmatrix}}} and we have L i f t ( y = 1 , y ^ = 1 ) − 1 = P ( y = y ^ = 1 ) λ 1 μ 1 − 1 = n 1 , 1 n n 1. n .1 − 1 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)-1={\frac {\mathbb {P} (y={\hat {y}}=1)}{\lambda _{1}\mu _{1}}}-1={\frac {n_{1,1}n}{n_{1.}n_{.1}}}-1} = n 1 , 1 ( n 1 , 1 + n 1 , 2 + n 2 , 1 + n 2 , 2 ) − ( n 1 , 1 + n 1 , 2 ) ( n 1 , 1 + n 2 , 1 ) n 1. n .1 = n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 n 1. n .1 {\displaystyle ={\frac {n_{1,1}(n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2})-(n_{1,1}+n_{1,2})(n_{1,1}+n_{2,1})}{n_{1.}n_{.1}}}={\frac {n_{1,1}n_{2,2}-n_{1,2}n_{2,1}}{n_{1.}n_{.1}}}} . Thus L i f t ( y = 1 , y ^ = 1 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Similarly, by swapping the roles of 1 and 2, we find that L i f t ( y = 2 , y ^ = 2 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=2,{\hat {y}}=2)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Dividing by n 1. n 2. {\displaystyle n_{1.}n_{2.}} we find that the necessary and sufficient condition on the normalized confusion matrix is s e n s i t i v i t y s p e c i f i c i t y − ( 1 − s e n s i t i v i t y ) ( 1 − s p e c i f i c i t y ) ≥ 0 ⟺ s e n s i t i v i t y + s p e c i f i c i t y − 1 ≥ 0 ⟺ J ≥ 0 {\displaystyle \mathrm {sensitivity} \ \mathrm {specificity} -(1-\mathrm {sensitivity} )(1-\mathrm {specificity} )\geq 0\iff \mathrm {sensitivity} +\mathrm {specificity} -1\geq 0\iff J\geq 0} . This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). === Random models === A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. This proposition shows that the model y ^ {\displaystyle {\hat {y}}} of y {\displaystyle y} is uninformative if and only if there are two families of numbers ( α i ) i {\displaystyle (\alpha _{i})_{i}} and ( β j ) j {\displaystyle (\beta _{j})_{j}} such that P ( { y = i } ∩ { y ^ = j } ) = α i β j {\displaystyle \mathbb {P} (\{y=i\}\cap \{{\hat {y}}=j\})=\alpha _{i}\beta _{j}} for any i {\displaystyle i} and j {\displaystyle j} . === Multiclass likelihood ratios and diagnostic odds ratios === We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i {\displaystyle i} and j ≠ i {\displaystyle j\not =i} , let L R i , j = P ( y ^ = j ∣ y = j ) P ( y ^ = j ∣ y = i ) {\displaystyle \mathrm {LR} _{i,j}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)}}} . When K = 2 {\displaystyle K=2} , if 2 is the class of interest,, we find the classical likelihood ratios L R 1 , 2 = L R + {\displaystyle \mathrm {LR} _{1,2}=\mathrm {LR} _{+}} and L R 2 , 1 = 1 L R − {\displaystyle \mathrm {LR} _{2,1}={\frac {1}{\mathrm {LR} _{-}}}} . Multiclass diagnostic odds ratios can also be defined using the formula D O R i , j = D O R j , i = L R i , j L R j , i = n i , i n j , j n i , j n j , i = P ( y ^ = j ∣ y = j ) / P ( y ^ = i ∣ y = j ) P ( y ^ = j ∣ y = i ) / P ( y ^ = i ∣ y = i ) {\displaystyle \mathrm {DOR} _{i,j}=\mathrm {DOR} _{j,i}=\mathrm {LR} _{i,j}\mathrm {LR} _{j,i}={\frac {n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)/\mathbb {P} ({\hat {y}}=i\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)/\mathbb {P} ({\hat {y}}=i\mid y=i)}}} We saw above that a better-than-chance model (or a random model) must verify L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} and λ i {\displaystyle \lambda _{i}} . According to the previous corollary, likelihood ratios are thus greater
Simulation noise
Simulation noise is a function that creates a divergence-free vector field. This signal can be used in artistic simulations for the purpose of increasing the perception of extra detail. The function can be calculated in three dimensions by dividing the space into a regular lattice grid. With each edge is associated a random value, indicating a rotational component of material revolving around the edge. By following rotating material into and out of faces, one can quickly sum the flux passing through each face of the lattice. Flux values at lattice faces are then interpolated to create a field value for all positions. Perlin noise is the earliest form of lattice noise, which has become very popular in computer graphics. Perlin Noise is not suited for simulation because it is not divergence-free. Noises based on lattices, such as simulation noise and Perlin noise, are often calculated at different frequencies and summed together to form band-limited fractal signals. Other approaches developed later that use vector calculus identities to produce divergence free fields, such as "Curl-Noise" as suggested by Rook Bridson, and "Divergence-Free Noise" due to Ivan DeWolf. These often require calculation of lattice noise gradients, which sometimes are not readily available. A naive implementation would call a lattice noise function several times to calculate its gradient, resulting in more computation than is strictly necessary. Unlike these noises, simulation noise has a geometric rationale in addition to its mathematical properties. It simulates vortices scattered in space, to produce its pleasing aesthetic. == Curl noise == The vector field is created as follows, for every point (x,y,z) in the space a vector field G is created, every component x, y and z of the vector field (Gx, Gy, Gz) is defined by a 3D perlin or simplex noise function with x, y and z as parameters. The partial derivative of Gx, Gy, and Gz respect to x, y and z is obtained with the gradient of the perlin or simplex noise by finite differences of implicit calculation inside the simplex noise. The partial derivatives are used to calculate F as the curl of G given by F = ( ∂ G z ∂ y − ∂ G y ∂ z , ∂ G x ∂ z − ∂ G z ∂ x , ∂ G y ∂ x − ∂ G x ∂ y ) {\displaystyle F=({\frac {\partial Gz}{\partial y}}-{\frac {\partial Gy}{\partial z}},{\frac {\partial Gx}{\partial z}}-{\frac {\partial Gz}{\partial x}},{\frac {\partial Gy}{\partial x}}-{\frac {\partial Gx}{\partial y}})} == Bitangent noise == This method is based in the fact that the curl of the gradient of scalar field is zero and the identity that expand the divergence of a cross product of two vectors A and B as the difference of the dot products of each vector with the curl of the other: ∇ × ( ∇ φ ) = 0 . {\displaystyle \nabla \times (\nabla \varphi )=\mathbf {0} .} ∇ ⋅ ( A × B ) = ( ∇ × A ) ⋅ B − A ⋅ ( ∇ × B ) {\displaystyle \nabla \cdot (\mathbf {A} \times \mathbf {B} )=\ (\nabla {\times }\mathbf {A} )\cdot \mathbf {B} \,-\,\mathbf {A} \cdot (\nabla {\times }\mathbf {B} )} which means that if the curl of both vector fields is zero then the divergence of the product of two vectors that are the gradients of scalar fields is zero too. This result in a divergence free vector field by construction only calling two noise functions to create the scalar fields. The vector field es created as follows, two scalar fields are calculated ϕ {\displaystyle \phi } and ψ {\displaystyle \psi } using 3D perlin or simplex noise functions, then the gradients A and B of each of this fields is calculated, the cross product of A and B gives a divergence free vector field. == Signed distance noise == The vector field is created based on a closed and differentiable implicit surface S = F(x,y,z) = 0. For every point in the space, frequently outside or near the surface, we get a vector g that is normal to the surface, this is the gradient of S or the partial derivatives respect to x, y and z, this vector is not unitary, but we can get a unitary normal n by dividing each component of the point by the magnitude of the gradient g. Outside of the surface all these normals point away from the surface. g = ∇ F ( x , y , z ) = ( ∂ F ∂ x , ∂ F ∂ y , ∂ F ∂ z ) {\displaystyle g=\nabla F(x,y,z)=\left({\frac {\partial F}{\partial x}},{\frac {\partial F}{\partial y}},{\frac {\partial F}{\partial z}}\right)} n = g ( x , y , z ) ‖ ∇ F ( x , y , z ) ‖ {\displaystyle \mathbf {n} ={\frac {g(x,y,z)}{\|\nabla F(x,y,z)\|}}} ‖ ∇ F ( x , y , z ) ‖ = ( ∂ F ∂ x ) 2 + ( ∂ F ∂ y ) 2 + ( ∂ F ∂ z ) 2 {\displaystyle \|\nabla F(x,y,z)\|={\sqrt {\left({\frac {\partial F}{\partial x}}\right)^{2}+\left({\frac {\partial F}{\partial y}}\right)^{2}+\left({\frac {\partial F}{\partial z}}\right)^{2}}}} Afterwards we calculate a scalar value p for that point in the space using a 3D perlin or simplex noise function. Now we create a vector field V = pn pointing outside of the surface. The curl of this vector field gives the direction in every point in the space where the particles should move. S D N = ( ∂ V z ∂ y − ∂ V y ∂ z , ∂ V x ∂ z − ∂ V z ∂ x , ∂ V y ∂ x − ∂ V x ∂ y ) {\displaystyle SDN=({\frac {\partial Vz}{\partial y}}-{\frac {\partial Vy}{\partial z}},{\frac {\partial Vx}{\partial z}}-{\frac {\partial Vz}{\partial x}},{\frac {\partial Vy}{\partial x}}-{\frac {\partial Vx}{\partial y}})} By construction this vector SDN will point in a tangent direction to an isosurface at the level of the signed distance to the original surface and can be used to confine the movements of the particles to stay in that surface.
Multilinear principal component analysis
Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis (PCA) that is used to analyze M-way arrays, also informally referred to as "data tensors". M-way arrays may be modeled by linear tensor models, such as CANDECOMP/Parafac, or by multilinear tensor models, such as multilinear principal component analysis (MPCA) or multilinear (tensor) independent component analysis (MICA). In 2005, Vasilescu and Terzopoulos introduced the Multilinear PCA terminology as a way to better differentiate between multilinear data models that employed 2nd order statistics versus higher order statistics to compute a set of independent components for each mode, such as Multilinear ICA Multilinear PCA may be applied to compute the causal factors of data formation, or as signal processing tool on data tensors whose individual observation have either been vectorized, or whose observations are treated as a collection of column/row observations, an "observation as a matrix", and concatenated into a data tensor. The latter approach is suitable for compression and reducing redundancy in the rows, columns and fibers that are unrelated to the causal factors of data formation. Vasilescu and Terzopoulos in their paper "TensorFaces" introduced the M-mode SVD algorithm which are algorithms misidentified in the literature as the HOSVD or the Tucker which employ the power method or gradient descent, respectively. Vasilescu and Terzopoulos framed the data analysis, recognition and synthesis problems as multilinear tensor problems. Data is viewed as the compositional consequence of several causal factors, that are well suited for multi-modal tensor factor analysis. The power of the tensor framework was showcased by analyzing human motion joint angles, facial images or textures in the following papers: Human Motion Signatures (CVPR 2001, ICPR 2002), face recognition – TensorFaces, (ECCV 2002, CVPR 2003, etc.) and computer graphics – TensorTextures (Siggraph 2004). == The algorithm == The MPCA solution follows the alternating least square (ALS) approach. It is iterative in nature. As in PCA, MPCA works on centered data. Centering is a little more complicated for tensors, and it is problem dependent. == Feature selection == MPCA features: Supervised MPCA is employed in causal factor analysis that facilitates object recognition while a semi-supervised MPCA feature selection is employed in visualization tasks. == Extensions == Various extension of MPCA: Robust MPCA (RMPCA) Multi-Tensor Factorization, that also finds the number of components automatically (MTF)