Best AI Video Creation Tools

Best AI Video Creation Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Brill tagger

    Brill tagger

    The Brill tagger is an inductive method for part-of-speech tagging. It was described and invented by Eric Brill in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is: a form of supervised learning, which aims to minimize error; and, a transformation-based process, in the sense that a tag is assigned to each word and changed using a set of predefined rules. In the transformation process, if the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. High accuracy is eventually achieved by applying these rules iteratively and changing the incorrect tags. This approach ensures that valuable information such as the morphosyntactic construction of words is employed in an automatic tagging process. == Algorithm == The algorithm starts with initialization, which is the assignment of tags based on their probability for each word (for example, "dog" is more often a noun than a verb). Then "patches" are determined via rules that correct (probable) tagging errors made in the initialization phase: Initialization: Known words (in vocabulary): assigning the most frequent tag associated to a form of the word Unknown word == Rules and processing == The input text is first tokenized, or broken into words. Typically in natural language processing, contractions such as "'s", "n't", and the like are considered separate word tokens, as are punctuation marks. A dictionary and some morphological rules then provide an initial tag for each word token. For example, a simple lookup would reveal that "dog" may be a noun or a verb (the most frequent tag is simply chosen), while an unknown word will be assigned some tag(s) based on capitalization, various prefix or suffix strings, etc. (such morphological analyses, which Brill calls Lexical Rules, may vary between implementations). After all word tokens have (provisional) tags, contextual rules apply iteratively, to correct the tags by examining small amounts of context. This is where the Brill method differs from other part of speech tagging methods such as those using Hidden Markov Models. Rules are reapplied repeatedly, until a threshold is reached, or no more rules can apply. Brill rules are of the general form: tag1 → tag2 IF Condition where the Condition tests the preceding and/or following word tokens, or their tags (the notation for such rules differs between implementations). For example, in Brill's notation: IN NN WDPREVTAG DT while would change the tag of a word from IN (preposition) to NN (common noun), if the preceding word's tag is DT (determiner) and the word itself is "while". This covers cases like "all the while" or "in a while", where "while" should be tagged as a noun rather than its more common use as a conjunction (many rules are more general). Rules should only operate if the tag being changed is also known to be permissible, for the word in question or in principle (for example, most adjectives in English can also be used as nouns). Rules of this kind can be implemented by simple Finite-state machines. See Part of speech tagging for more general information including descriptions of the Penn Treebank and other sets of tags. Typical Brill taggers use a few hundred rules, which may be developed by linguistic intuition or by machine learning on a pre-tagged corpus. == Code == Brill's code pages at Johns Hopkins University are no longer on the web. An archived version of a mirror of the Brill tagger at its latest version as it was available at Plymouth Tech can be found on Archive.org. The software uses the MIT License.

    Read more →
  • International Conference on Computer Vision

    International Conference on Computer Vision

    The International Conference on Computer Vision (ICCV) is a research conference sponsored by the Institute of Electrical and Electronics Engineers (IEEE) held every other year. It is considered to be one of the top conferences in computer vision, alongside CVPR and ECCV, and it is held on years in which ECCV is not. The conference is usually spread over four to five days. Typically, experts in the focus areas give tutorial talks on the first day, then the technical sessions (and poster sessions in parallel) follow. Recent conferences have also had an increasing number of focused workshops and a commercial exhibition. == Awards == === Azriel Rosenfeld Lifetime Achievement Award === The Azriel Rosenfeld Award, or Azriel Rosenfeld Lifetime Achievement Award, recognizes researchers who have made significant contributions to the field of computer vision over their careers. It is named in memory of computer scientist and mathematician Azriel Rosenfeld. The following people have received this award: === Helmholtz Prize === The ICCV Helmholtz Prize, known as the Test of Time Award before 2013, is awarded every other year at the ICCV, recognizing ICCV papers from ten or more years earlier that had a significant impact on computer vision research. Winners are selected by the IEEE Computer Society's Technical Committee on Pattern Analysis and Machine Intelligence. The award is named after the 19th century physician and physicist Hermann von Helmholtz, and the ICCV's award is not related to the various Helmholtz Prizes in physics, or the Hermann von Helmholtz Prize in neuroscience. === Marr Prize === The ICCV best-paper award is the Marr Prize, named after British neuroscientist David Marr. === Mark Everingham Prize === The Mark Everingham Prize is an award given yearly by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society at the IEEE International Conference on Computer Vision or the European Conference on Computer Vision to commemorate the late Mark Everingham, "one of the rising stars of computer vision", and to encourage others to follow in his footsteps by acting to further progress in the computer vision community as a whole. The prize is given to a researcher, or a team of researchers, who have made a selfless contribution of significant benefit to other members of the computer vision community. The Mark Everingham Prize for Rigorous Evaluation was an award given in 2012 at the British Machine Vision Conference. === PAMI Distinguished Researcher Award === The PAMI Distinguished Researcher Award (until 2013 called Significant Researcher Award) is awarded to candidates whose research projects have significantly contributed to the progress of computer vision. Awards are made based on major research contributions, as well as the role of those contributions in influencing and inspiring other research. Candidates are nominated by the community. The following people have received this award: == Conference list == The conference is usually held in the Spring in various international locations.

    Read more →
  • Representer theorem

    Representer theorem

    For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f ∗ {\displaystyle f^{}} of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data. == Formal statement == The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: Theorem: Consider a positive-definite real-valued kernel k : X × X → R {\displaystyle k:{\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } on a non-empty set X {\displaystyle {\mathcal {X}}} with a corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} . Let there be given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\dotsc ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } , a strictly increasing real-valued function g : [ 0 , ∞ ) → R {\displaystyle g\colon [0,\infty )\to \mathbb {R} } , and an arbitrary error function E : ( X × R 2 ) n → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{n}\to \mathbb {R} \cup \lbrace \infty \rbrace } , which together define the following regularized empirical risk functional on H k {\displaystyle H_{k}} : f ↦ E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) . {\displaystyle f\mapsto E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right).} Then, any minimizer of the empirical risk f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) } , ( ∗ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right)\right\rbrace ,\quad ()} admits a representation of the form: f ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } for all 1 ≤ i ≤ n {\displaystyle 1\leq i\leq n} . Proof: Define a mapping φ : X → H k φ ( x ) = k ( ⋅ , x ) {\displaystyle {\begin{aligned}\varphi \colon {\mathcal {X}}&\to H_{k}\\\varphi (x)&=k(\cdot ,x)\end{aligned}}} (so that φ ( x ) = k ( ⋅ , x ) {\displaystyle \varphi (x)=k(\cdot ,x)} is itself a map X → R {\displaystyle {\mathcal {X}}\to \mathbb {R} } ). Since k {\displaystyle k} is a reproducing kernel, then φ ( x ) ( x ′ ) = k ( x ′ , x ) = ⟨ φ ( x ′ ) , φ ( x ) ⟩ , {\displaystyle \varphi (x)(x')=k(x',x)=\langle \varphi (x'),\varphi (x)\rangle ,} where ⟨ ⋅ , ⋅ ⟩ {\displaystyle \langle \cdot ,\cdot \rangle } is the inner product on H k {\displaystyle H_{k}} . Given any x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , one can use orthogonal projection to decompose any f ∈ H k {\displaystyle f\in H_{k}} into a sum of two functions, one lying in span ⁡ { φ ( x 1 ) , … , φ ( x n ) } {\displaystyle \operatorname {span} \left\lbrace \varphi (x_{1}),\ldots ,\varphi (x_{n})\right\rbrace } , and the other lying in the orthogonal complement: f = ∑ i = 1 n α i φ ( x i ) + v , {\displaystyle f=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,} where ⟨ v , φ ( x i ) ⟩ = 0 {\displaystyle \langle v,\varphi (x_{i})\rangle =0} for all i {\displaystyle i} . The above orthogonal decomposition and the reproducing property together show that applying f {\displaystyle f} to any training point x j {\displaystyle x_{j}} produces f ( x j ) = ⟨ ∑ i = 1 n α i φ ( x i ) + v , φ ( x j ) ⟩ = ∑ i = 1 n α i ⟨ φ ( x i ) , φ ( x j ) ⟩ , {\displaystyle f(x_{j})=\left\langle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,\varphi (x_{j})\right\rangle =\sum _{i=1}^{n}\alpha _{i}\langle \varphi (x_{i}),\varphi (x_{j})\rangle ,} which we observe is independent of v {\displaystyle v} . Consequently, the value of the error function E {\displaystyle E} in () is likewise independent of v {\displaystyle v} . For the second term (the regularization term), since v {\displaystyle v} is orthogonal to ∑ i = 1 n α i φ ( x i ) {\displaystyle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})} and g {\displaystyle g} is strictly monotonic, we have g ( ‖ f ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) + v ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ 2 + ‖ v ‖ 2 ) ≥ g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ ) . {\displaystyle {\begin{aligned}g\left(\lVert f\rVert \right)&=g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v\rVert \right)\\&=g\left({\sqrt {\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert ^{2}+\lVert v\rVert ^{2}}}\right)\\&\geq g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert \right).\end{aligned}}} Therefore, setting v = 0 {\displaystyle v=0} does not affect the first term of (), while it strictly decreases the second term. Consequently, any minimizer f ∗ {\displaystyle f^{}} in () must have v = 0 {\displaystyle v=0} , i.e., it must be of the form f ∗ ( ⋅ ) = ∑ i = 1 n α i φ ( x i ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} which is the desired result. == Generalizations == The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such. The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) = 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 , g ( ‖ f ‖ ) = λ ‖ f ‖ 2 {\displaystyle {\begin{aligned}E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)&={\frac {1}{n}}\sum _{i=1}^{n}(f(x_{i})-y_{i})^{2},\\g(\lVert f\rVert )&=\lambda \lVert f\rVert ^{2}\end{aligned}}} for λ > 0 {\displaystyle \lambda >0} . Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function g ( ⋅ ) {\displaystyle g(\cdot )} of the Hilbert space norm. It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization f ~ ∗ = argmin ⁡ { E ( ( x 1 , y 1 , f ~ ( x 1 ) ) , … , ( x n , y n , f ~ ( x n ) ) ) + g ( ‖ f ‖ ) ∣ f ~ = f + h ∈ H k ⊕ span ⁡ { ψ p ∣ 1 ≤ p ≤ M } } , ( † ) {\displaystyle {\tilde {f}}^{}=\operatorname {argmin} \left\lbrace E\left((x_{1},y_{1},{\tilde {f}}(x_{1})),\ldots ,(x_{n},y_{n},{\tilde {f}}(x_{n}))\right)+g\left(\lVert f\rVert \right)\mid {\tilde {f}}=f+h\in H_{k}\oplus \operatorname {span} \lbrace \psi _{p}\mid 1\leq p\leq M\rbrace \right\rbrace ,\quad (\dagger )} i.e., we consider functions of the form f ~ = f + h {\displaystyle {\tilde {f}}=f+h} , where f ∈ H k {\displaystyle f\in H_{k}} and h {\displaystyle h} is an unpenalized function lying in the span of a finite set of real-valued functions { ψ p : X → R ∣ 1 ≤ p ≤ M } {\displaystyle \lbrace \psi _{p}\colon {\mathcal {X}}\to \mathbb {R} \mid 1\leq p\leq M\rbrace } . Under the assumption that the n × M {\displaystyle n\times M} matrix ( ψ p ( x i ) ) i p {\displaystyle \left(\psi _{p}(x_{i})\right)_{ip}} has rank M {\displaystyle M} , they show that the minimizer f ~ ∗ {\displaystyle {\tilde {f}}^{}} in ( † ) {\displaystyle (\dagger )} admits a representation of the form f ~ ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) + ∑ p = 1 M β p ψ p ( ⋅ ) {\displaystyle {\tilde {f}}^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i})+\sum _{p=1}^{M}\beta _{p}\psi _{p}(\cdot )} where α i , β p ∈ R {\displaystyle \alpha _{i},\beta _{p}\in \mathbb {R} } and the β p {\displaystyle \beta _{p}} are all uniquely determined. The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following: Theorem: Let X {\displaystyle {\mathcal {X}}} be a nonempty set, k {\displaystyle k} a positive-definite real-valued kernel on X × X {\displaystyle {\mathcal {X}}\times {\mathcal {X}}} with corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} , and let R : H k → R {\displaystyle R\colon H_{k}\to \mathbb {R} } be a differentiable regularization function. Then given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } and an arbitrary error function E : ( X × R 2 ) m → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{m}\to \mathbb {R} \cup \lbrace \infty \rbrace } , a minimizer f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + R ( f ) } ( ‡ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+R(f)\right\rbrace \quad (\ddagger )} of the regularized empirical risk admits a repr

    Read more →
  • Stochastic block model

    Stochastic block model

    The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data. == Definition == The stochastic block model takes the following parameters: The number n {\displaystyle n} of vertices; a partition of the vertex set { 1 , … , n } {\displaystyle \{1,\ldots ,n\}} into disjoint subsets C 1 , … , C r {\displaystyle C_{1},\ldots ,C_{r}} , called communities; a symmetric r × r {\displaystyle r\times r} matrix P {\displaystyle P} of edge probabilities. The edge set is then sampled at random as follows: any two vertices u ∈ C i {\displaystyle u\in C_{i}} and v ∈ C j {\displaystyle v\in C_{j}} are connected by an edge with probability P i j {\displaystyle P_{ij}} . An example problem is: given a graph with n {\displaystyle n} vertices, where the edges are sampled as described, recover the groups C 1 , … , C r {\displaystyle C_{1},\ldots ,C_{r}} . == Special cases == If the probability matrix is a constant, in the sense that P i j = p {\displaystyle P_{ij}=p} for all i , j {\displaystyle i,j} , then the result is the Erdős–Rényi model G ( n , p ) {\displaystyle G(n,p)} . This case is degenerate—the partition into communities becomes irrelevant—but it illustrates a close relationship to the Erdős–Rényi model. The planted partition model is the special case that the values of the probability matrix P {\displaystyle P} are a constant p {\displaystyle p} on the diagonal and another constant q {\displaystyle q} off the diagonal. Thus two vertices within the same community share an edge with probability p {\displaystyle p} , while two vertices in different communities share an edge with probability q {\displaystyle q} . Sometimes it is this restricted model that is called the stochastic block model. The case where p > q {\displaystyle p>q} is called an assortative model, while the case p < q {\displaystyle p P j k {\displaystyle P_{ii}>P_{jk}} whenever j ≠ k {\displaystyle j\neq k} : all diagonal entries dominate all off-diagonal entries. A model is called weakly assortative if P i i > P i j {\displaystyle P_{ii}>P_{ij}} whenever i ≠ j {\displaystyle i\neq j} : each diagonal entry is only required to dominate the rest of its own row and column. Disassortative forms of this terminology exist, by reversing all inequalities. For some algorithms, recovery might be easier for block models with assortative or disassortative conditions of this form. == Typical statistical tasks == Much of the literature on algorithmic community detection addresses three statistical tasks: detection, partial recovery, and exact recovery. === Detection === The goal of detection algorithms is simply to determine, given a sampled graph, whether the graph has latent community structure. More precisely, a graph might be generated, with some known prior probability, from a known stochastic block model, and otherwise from a similar Erdos-Renyi model. The algorithmic task is to correctly identify which of these two underlying models generated the graph. === Partial recovery === In partial recovery, the goal is to approximately determine the latent partition into communities, in the sense of finding a partition that is correlated with the true partition significantly better than a random guess. === Exact recovery === In exact recovery, the goal is to recover the latent partition into communities exactly. The community sizes and probability matrix may be known or unknown. == Statistical lower bounds and threshold behavior == Stochastic block models exhibit a sharp threshold effect reminiscent of percolation thresholds. Suppose that we allow the size n {\displaystyle n} of the graph to grow, keeping the community sizes in fixed proportions. If the probability matrix remains fixed, tasks such as partial and exact recovery become feasible for all non-degenerate parameter settings. However, if we scale down the probability matrix at a suitable rate as n {\displaystyle n} increases, we observe a sharp phase transition: for certain settings of the parameters, it will become possible to achieve recovery with probability tending to 1, whereas on the opposite side of the parameter threshold, the probability of recovery tends to 0 no matter what algorithm is used. For partial recovery, the appropriate scaling is to take P i j = P ~ i j / n {\displaystyle P_{ij}={\tilde {P}}_{ij}/n} for fixed P ~ {\displaystyle {\tilde {P}}} , resulting in graphs of constant average degree. In the case of two equal-sized communities, in the assortative planted partition model with probability matrix P = ( p ~ / n q ~ / n q ~ / n p ~ / n ) , {\displaystyle P=\left({\begin{array}{cc}{\tilde {p}}/n&{\tilde {q}}/n\\{\tilde {q}}/n&{\tilde {p}}/n\end{array}}\right),} partial recovery is feasible with probability 1 − o ( 1 ) {\displaystyle 1-o(1)} whenever ( p ~ − q ~ ) 2 > 2 ( p ~ + q ~ ) {\displaystyle ({\tilde {p}}-{\tilde {q}})^{2}>2({\tilde {p}}+{\tilde {q}})} , whereas any estimator fails partial recovery with probability 1 − o ( 1 ) {\displaystyle 1-o(1)} whenever ( p ~ − q ~ ) 2 < 2 ( p ~ + q ~ ) {\displaystyle ({\tilde {p}}-{\tilde {q}})^{2}<2({\tilde {p}}+{\tilde {q}})} . For exact recovery, the appropriate scaling is to take P i j = P ~ i j log ⁡ n / n {\displaystyle P_{ij}={\tilde {P}}_{ij}\log n/n} , resulting in graphs of logarithmic average degree. Here a similar threshold exists: for the assortative planted partition model with r {\displaystyle r} equal-sized communities, the threshold lies at p ~ − q ~ = r {\displaystyle {\sqrt {\tilde {p}}}-{\sqrt {\tilde {q}}}={\sqrt {r}}} . In fact, the exact recovery threshold is known for the fully general stochastic block model. == Algorithms == In principle, exact recovery can be solved in its feasible range using maximum likelihood, but this amounts to solving a constrained or regularized cut problem such as minimum bisection that is typically NP-complete. Hence, no known efficient algorithms will correctly compute the maximum-likelihood estimate in the worst case. However, a wide variety of algorithms perform well in the average case, and many high-probability performance guarantees have been proven for algorithms in both the partial and exact recovery settings. Successful algorithms include spectral clustering of the vertices, semidefinite programming, forms of belief propagation, and community detection among others. == Variants == Several variants of the model exist. One minor tweak allocates vertices to communities randomly, according to a categorical distribution, rather than in a fixed partition. More significant variants include the degree-corrected stochastic block model, the hierarchical stochastic block model, the geometric block model, censored block model and the mixed-membership block model. == Topic models == Stochastic block model have been recognised to be a topic model on bipartite networks. In a network of documents and words, Stochastic block model can identify topics: group of words with a similar meaning. == Extensions to signed graphs == Signed graphs allow for both favorable and adverse relationships and serve as a common model choice for various data analysis applications, e.g., correlation clustering. The stochastic block model can be trivially extended to signed graphs by assigning both positive and negative edge weights or equivalently using a difference of adjacency matrices of two stochastic block models. == DARPA/MIT/AWS Graph Challenge: streaming stochastic block partition == GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field. Streaming stochastic block partition is one of the challenges since 2017. Spectral clustering has demonstrated outstanding performance compared to the original and even improved base algorithm, matching its quality of clusters while being multiple orders of magnitude faster.

    Read more →
  • Gallery software

    Gallery software

    Gallery software is software that helps the user publish or share photos, pictures, videos or other digital media. Most galleries are located on Web servers, where users are allowed to register and publish their pictures. Gallery software usually features automatic image resizing, allows digital media be categorized into sets, and allows comments. == Types == Early digital media publishing and sharing was done with imageboards. The boards are by topics, sometimes called "chan". Each discussion in a "chan" are started with a piece of digital media, and follow-up discussions can contain another piece too. Software works in this way: Futallaby, Danbooru. Traditionally, galleries are managed. An administrator maintains a set of or hierarchy of albums. The users can upload their digital media in one of the existing albums defined by an administrator, or create their own albums. The users with sufficient permission can re-categorise the digital media others uploaded. Often, the site's administrator can define which album the users are allowed to categorise their media into, or delete other user's content. Examples are open source galleries Coppermine, Gallery Project. There are decentralised gallery software that does not have an administrator for managing contents. Pinterest, Flickr and DeviantArt has been successful with this model. Open source gallery software MediaGoblin works in this way. Each user can create their own "collections", to categorise theirs or other users' media. However users cannot put media into other user's collections. Each user's category is separate. There is no centralised theme or hierarchy for the media.

    Read more →
  • Persian Speech Corpus

    Persian Speech Corpus

    The Persian Speech Corpus is a Modern Persian speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about 2.5 hours of Persian speech aligned with recorded speech on the phoneme level, including annotations of word boundaries. Previous spoken corpora of Persian include FARSDAT, which consists of read aloud speech from newspaper texts from 100 Persian speakers and the Telephone FARsi Spoken language DATabase (TFARSDAT) which comprises seven hours of read and spontaneous speech produced by 60 native speakers of Persian from ten regions of Iran. The Persian Speech Corpus was built using the same methodologies laid out in the doctoral project on Modern Standard Arabic of Nawar Halabi at the University of Southampton. The work was funded by MicroLinkPC, who own an exclusive license to commercialise the corpus, though the corpus is available for non-commercial use through the corpus' website. It is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The corpus was built for speech synthesis purposes, but has been used for building HMM based voices in Persian. It can also be used to automatically align other speech corpora with their phonetic transcript and could be used as part of a larger corpus for training speech recognition systems. == Contents == The corpus is downloadable from its website, and contains the following: 396 .wav files containing spoken utterances 396 .lab files containing text utterances 396 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. phonetic-transcript.txt which has the form "[wav_filename]" "[Phoneme Sequence]" in every line orthographic-transcript.txt which has the form "[wav_filename]" "[Orthographic Transcript]" in every line

    Read more →
  • Determining the number of clusters in a data set

    Determining the number of clusters in a data set

    Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision. == Elbow method == The elbow method looks at the percentage of explained variance as a function of the number of clusters: One should choose a number of clusters so that adding another cluster does not give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". In most datasets, this "elbow" is ambiguous, making this method subjective and unreliable. Because the scale of the axes is arbitrary, the concept of an angle is not well-defined, and even on uniform random data, the curve produces an "elbow", making the method rather unreliable. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance. The method can be traced to speculation by Robert L. Thorndike in 1953. While the idea of the elbow method sounds simple and straightforward, other methods (as detailed below) give better results. == X-means clustering == In statistics and data mining, X-means clustering is a variation of k-means clustering that refines cluster assignments by repeatedly attempting subdivision, and keeping the best resulting splits, until a criterion such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) is reached. == Information criterion approach == Another set of methods for determining the number of clusters are information criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the deviance information criterion (DIC) — if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values. == Information–theoretic approach == Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information-theoretic standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump representing the best choice. The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a p-dimensional random variable, X, consisting of a mixture distribution of G components with common covariance, Γ. If we let c 1 … c K {\displaystyle c_{1}\ldots c_{K}} be a set of K cluster centers, with c X {\displaystyle c_{X}} the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is: d K = 1 p min c 1 … c K E [ ( X − c X ) T Γ − 1 ( X − c X ) ] {\displaystyle d_{K}={\frac {1}{p}}\min _{c_{1}\ldots c_{K}}{E[(X-c_{X})^{T}\Gamma ^{-1}(X-c_{X})]}} This is also the average Mahalanobis distance per dimension between X and the closest cluster center c X {\displaystyle c_{X}} . Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudo-code for the jump method with an input set of p-dimensional data points X is: JumpMethod(X): Let Y = (p/2) Init a list D, of size n+1 Let D[0] = 0 For k = 1 ... n: Cluster X with k clusters (e.g., with k-means) Let d = Distortion of the resulting clustering D[k] = d^(-Y) Define J(i) = D[i] - D[i-1] Return the k between 1 and n that maximizes J(k) The choice of the transform power Y = ( p / 2 ) {\displaystyle Y=(p/2)} is motivated by asymptotic reasoning using results from rate distortion theory. Let the data X have a single, arbitrarily p-dimensional Gaussian distribution, and let fixed K = ⌊ α p ⌋ {\displaystyle K=\lfloor \alpha ^{p}\rfloor } , for some α greater than zero. Then the distortion of a clustering of K clusters in the limit as p goes to infinity is α − 2 {\displaystyle \alpha ^{-2}} . It can be seen that asymptotically, the distortion of a clustering to the power ( − p / 2 ) {\displaystyle (-p/2)} is proportional to α p {\displaystyle \alpha ^{p}} , which by definition is approximately the number of clusters K. In other words, for a single Gaussian distribution, increasing K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components. Let X be a mixture of G p-dimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, K = ⌊ α p ⌋ {\displaystyle K=\lfloor \alpha ^{p}\rfloor } , the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to α − 2 {\displaystyle \alpha ^{-2}} . Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters, K. Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion d K − p / 2 {\displaystyle d_{K}^{-p/2}} is approximately zero for K < G, then jumps suddenly and begins increasing linearly for K ≥ G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for K < G, and along the linearly increasing phase of the transformed distortion plot for K ≥ G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non-parametric and has been shown to be viable for general mixture distributions. == Silhouette method == The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is match

    Read more →
  • Multidimensional analysis

    Multidimensional analysis

    In statistics, econometrics and related fields, multidimensional analysis (MDA) is a data analysis process that groups data into two categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a single football team at each of several years is a single-dimensional (in this case, longitudinal) data set. A data set consisting of the number of wins for several football teams in a single year is also a single-dimensional (in this case, cross-sectional) data set. A data set consisting of the number of wins for several football teams over several years is a two-dimensional data set. == Higher dimensions == In many disciplines, two-dimensional data sets are also called panel data. While, strictly speaking, two- and higher-dimensional data sets are "multi-dimensional", the term "multidimensional" tends to be applied only to data sets with three or more dimensions. For example, some forecast data sets provide forecasts for multiple target periods, conducted by multiple forecasters, and made at multiple horizons. The three dimensions provide more information than can be gleaned from two-dimensional panel data sets. == Software == Computer software for MDA include Online analytical processing (OLAP) for data in relational databases, pivot tables for data in spreadsheets, and Array DBMSs for general multi-dimensional data (such as raster data) in science, engineering, and business.

    Read more →
  • NCover

    NCover

    NCover is a .NET code coverage tool. There are two non-related NCover products that do .NET code coverage. There is an open source NCover that can be found on SourceForge and there is a company called NCover, LLC. There has been additional development on both products since this 2004 reference. The company NCover, LLC began when the founder, Peter Waldschmidt, decided to commercialize the open source tool he created. The commercial versions were launched in 2007, but the last supported free version 1.5.8 is still available on the company site.

    Read more →
  • Ilastik

    Ilastik

    ilastik is free open source software for image classification and segmentation. No previous experience in image processing is required to run the software. Since 2018 ilastik is further developed and maintained by Anna Kreshuk's group at European Molecular Biology Laboratory. == Features == ilastik allows user to annotate an arbitrary number of classes in images with a mouse interface. Using these user annotations and the generic (nonlinear) image features, the user can train a random forest classifier. Trained ilastik classifiers can be applied new data not included in the training set in ilastik via its batch processing functionality, or without using the graphical user interface, in headless mode. ilastik can be integrated into various related tools: Pre-trained workflows can be executed directly from ImageJ/Fiji using the ilastik-ImageJ plugin. Pre-trained ilastik Pixel Classification workflows can be run directly in Python with the ilastik Python package, which is available via conda. ilastik has a CellProfiler module to use ilastik classifiers to process images within a CellProfiler framework. == History == ilastik was first released in 2011 by scientists at the Heidelberg Collaboratory for Image Processing (HCI), University of Heidelberg. == Application == The Interactive Learning and Segmentation Toolkit Carving Cell classification and neuron classification Synapse detection Cell tracking Neural Network Classification == Resources == ilastik project is hosted on GitHub. It is a collaborative project, any contributions such as comments, bug reports, bug fixes or code contributions are welcome. The ilastik team can be contacted for user support on the image.sc forum.

    Read more →
  • Sammon mapping

    Sammon mapping

    Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality (see multidimensional scaling) by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969. It is considered a non-linear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications. Denote the distance between ith and jth objects in the original space by d i j ∗ {\displaystyle \scriptstyle d_{ij}^{}} , and the distance between their projections by d i j {\displaystyle \scriptstyle d_{ij}^{}} . Sammon's mapping aims to minimize the following error function, which is often referred to as Sammon's stress or Sammon's error: E = 1 ∑ i < j d i j ∗ ∑ i < j ( d i j ∗ − d i j ) 2 d i j ∗ . {\displaystyle E={\frac {1}{\sum \limits _{i Read more →

  • Semantic mapping (statistics)

    Semantic mapping (statistics)

    Semantic mapping (SM) is a statistical method for dimensionality reduction (the transformation of data from a high-dimensional space into a low-dimensional space). SM can be used in a set of multidimensional vectors of features to extract a few new features that preserves the main data characteristics. SM performs dimensionality reduction by clustering the original features in semantic clusters and combining features mapped in the same cluster to generate an extracted feature. Given a data set, this method constructs a projection matrix that can be used to map a data element from a high-dimensional space into a reduced dimensional space. SM can be applied in construction of text mining and information retrieval systems, as well as systems managing vectors of high dimensionality. SM is an alternative to random mapping, principal components analysis and latent semantic indexing methods.

    Read more →
  • Google Cloud Dataflow

    Google Cloud Dataflow

    Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Dataflow provides a fully managed service for executing Apache Beam pipelines, offering features like autoscaling, dynamic work rebalancing, and a managed execution environment. Dataflow is suitable for large-scale, continuous data processing jobs, and is one of the major components of Google's big data architecture on the Google Cloud Platform. At its core, Dataflow's architecture is designed to abstract away infrastructure management, allowing developers to focus purely on the logic of their data processing tasks. When a pipeline written using the Apache Beam SDK is submitted, Dataflow translates this high-level definition into an optimized job graph. The service then provisions and manages a fleet of Google Compute Engine workers to execute this graph in a highly parallelized and fault-tolerant manner. This serverless approach, combined with intelligent autoscaling of both the number of workers (horizontal) and the resources per worker (vertical), ensures that jobs have the precise amount of computational power needed at any given time, optimizing both performance and cost. The service's deep integration with the Google Cloud ecosystem makes it a powerful tool for a variety of use cases beyond simple data movement. For real-time analytics, Dataflow can ingest unbounded streams of data from Cloud Pub/Sub, perform complex transformations, and load results into BigQuery for immediate querying. In machine learning workflows, it is commonly used to preprocess and transform massive datasets stored in Cloud Storage, preparing them for training models in Vertex AI. This versatility makes it the central processing engine for modern ETL (Extract, Transform, Load) operations, streaming analytics, and large-scale data preparation within the cloud. == History == Google Cloud Dataflow was announced in June, 2014 and released to the general public as an open beta in April, 2015. In January, 2016 Google donated the underlying SDK, the implementation of a local runner, and a set of IOs (data connectors) to access Google Cloud Platform data services to the Apache Software Foundation. The donated code formed the original basis for Apache Beam. In August 2022, there was an incident where user timers were broken for certain Dataflow streaming pipelines in multiple regions, which was later resolved. Throughout 2023 and 2024, there have been various other updates and incidents affecting Google Cloud Dataflow, as documented in the release notes and service health history. The donation of the Dataflow SDK to the Apache Software Foundation was a pivotal moment, establishing Apache Beam as a unified, open-source programming model for defining both batch and streaming data pipelines. This strategic move decoupled the pipeline definition from the execution engine. As a result, developers could write portable data processing logic that was not locked into Google's ecosystem. A Beam pipeline can be executed on various runners, including Apache Flink, Apache Spark, and, of course, the highly optimized Google Cloud Dataflow service, providing flexibility and future-proofing data processing investments. == Features == Google Cloud Dataflow supports both batch and streaming data processing pipelines. It automatically handles resource provisioning, data sharding, and scaling according to workload, reducing manual configuration needed for large-scale data operations. == Use cases == Dataflow is used for ETL (Extract, Transform, Load) data pipelines, real-time analytics, and event stream processing for companies in industries such as finance, advertising, and IoT.

    Read more →
  • Differential evolution

    Differential evolution

    Differential evolution (DE) is an evolutionary algorithm to optimize a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the optimized problem and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means DE does not require the optimization problem to be differentiable, as is required by classic optimization methods such as gradient descent and quasi-newton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way, the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. == History == Storn and Price introduced Differential Evolution in 1995. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. Surveys on the multi-faceted research aspects of DE can be found in journal articles. == Algorithm == A basic variant of the DE algorithm works by having a population of candidate solutions (called agents). These agents are moved around in the search-space by using simple mathematical formulae to combine the positions of existing agents from the population. If the new position of an agent is an improvement then it is accepted and forms part of the population, otherwise the new position is simply discarded. The process is repeated and by doing so it is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered. Formally, let f : R n → R {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } be the fitness function which must be minimized (note that maximization can be performed by considering the function h := − f {\displaystyle h:=-f} instead). The function takes a candidate solution as argument in the form of a vector of real numbers. It produces a real number as output which indicates the fitness of the given candidate solution. The gradient of f {\displaystyle f} is not known. The goal is to find a solution m {\displaystyle \mathbf {m} } for which f ( m ) ≤ f ( p ) {\displaystyle f(\mathbf {m} )\leq f(\mathbf {p} )} for all p {\displaystyle \mathbf {p} } in the search-space, which means that m {\displaystyle \mathbf {m} } is the global minimum. Let x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} designate a candidate solution (agent) in the population. The basic DE algorithm can then be described as follows: Choose the parameters NP ≥ 4 {\displaystyle {\text{NP}}\geq 4} , CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} , and F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} . NP : NP {\displaystyle {\text{NP}}} is the population size, i.e. the number of candidate agents or "parents". CR : The parameter CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} is called the crossover probability. F : The parameter F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} is called the differential weight. Typical settings are N P = 10 n {\displaystyle NP=10n} , C R = 0.9 {\displaystyle CR=0.9} and F = 0.8 {\displaystyle F=0.8} . Optimization performance may be greatly impacted by these choices; see below. Initialize all agents x {\displaystyle \mathbf {x} } with random positions in the search-space. Until a termination criterion is met (e.g. number of iterations performed, or adequate fitness reached), repeat the following: For each agent x {\displaystyle \mathbf {x} } in the population do: Pick three agents a , b {\displaystyle \mathbf {a} ,\mathbf {b} } , and c {\displaystyle \mathbf {c} } from the population at random, they must be distinct from each other as well as from agent x {\displaystyle \mathbf {x} } . ( a {\displaystyle \mathbf {a} } is called the "base" vector.) Pick a random index R ∈ { 1 , … , n } {\displaystyle R\in \{1,\ldots ,n\}} where n {\displaystyle n} is the dimensionality of the problem being optimized. Compute the agent's potentially new position y = [ y 1 , … , y n ] {\displaystyle \mathbf {y} =[y_{1},\ldots ,y_{n}]} as follows: For each i ∈ { 1 , … , n } {\displaystyle i\in \{1,\ldots ,n\}} , pick a uniformly distributed random number r i ∼ U ( 0 , 1 ) {\displaystyle r_{i}\sim U(0,1)} If r i < C R {\displaystyle r_{i} Read more →

  • FastICA

    FastICA

    FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration. == Algorithm == === Prewhitening the data === Let the X := ( x i j ) ∈ R N × M {\displaystyle \mathbf {X} :=(x_{ij})\in \mathbb {R} ^{N\times M}} denote the input data matrix, M {\displaystyle M} the number of columns corresponding with the number of samples of mixed signals and N {\displaystyle N} the number of rows corresponding with the number of independent source signals. The input data matrix X {\displaystyle \mathbf {X} } must be prewhitened, or centered and whitened, before applying the FastICA algorithm to it. Centering the data entails demeaning each component of the input data X {\displaystyle \mathbf {X} } , that is, for each i = 1 , … , N {\displaystyle i=1,\ldots ,N} and j = 1 , … , M {\displaystyle j=1,\ldots ,M} . After centering, each row of X {\displaystyle \mathbf {X} } has an expected value of 0 {\displaystyle 0} . Whitening the data requires a linear transformation L : R N × M → R N × M {\displaystyle \mathbf {L} :\mathbb {R} ^{N\times M}\to \mathbb {R} ^{N\times M}} of the centered data so that the components of L ( X ) {\displaystyle \mathbf {L} (\mathbf {X} )} are uncorrelated and have variance one. More precisely, if X {\displaystyle \mathbf {X} } is a centered data matrix, the covariance of L x := L ( X ) {\displaystyle \mathbf {L} _{\mathbf {x} }:=\mathbf {L} (\mathbf {X} )} is the ( N × N ) {\displaystyle (N\times N)} -dimensional identity matrix, that is, A common method for whitening is by performing an eigenvalue decomposition on the covariance matrix of the centered data X {\displaystyle \mathbf {X} } , E { X X T } = E D E T {\displaystyle E\left\{\mathbf {X} \mathbf {X} ^{T}\right\}=\mathbf {E} \mathbf {D} \mathbf {E} ^{T}} , where E {\displaystyle \mathbf {E} } is the matrix of eigenvectors and D {\displaystyle \mathbf {D} } is the diagonal matrix of eigenvalues. The whitened data matrix is defined thus by === Single component extraction === The iterative algorithm finds the direction for the weight vector w ∈ R N {\displaystyle \mathbf {w} \in \mathbb {R} ^{N}} that maximizes a measure of non-Gaussianity of the projection w T X {\displaystyle \mathbf {w} ^{T}\mathbf {X} } , with X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} denoting a prewhitened data matrix as described above. Note that w {\displaystyle \mathbf {w} } is a column vector. To measure non-Gaussianity, FastICA relies on a nonquadratic nonlinear function f ( u ) {\displaystyle f(u)} , its first derivative g ( u ) {\displaystyle g(u)} , and its second derivative g ′ ( u ) {\displaystyle g^{\prime }(u)} . Hyvärinen states that the functions are useful for general purposes, while may be highly robust. The steps for extracting the weight vector w {\displaystyle \mathbf {w} } for single component in FastICA are the following: Randomize the initial weight vector w {\displaystyle \mathbf {w} } Let w + ← E { X g ( w T X ) T } − E { g ′ ( w T X ) } w {\displaystyle \mathbf {w} ^{+}\leftarrow E\left\{\mathbf {X} g(\mathbf {w} ^{T}\mathbf {X} )^{T}\right\}-E\left\{g'(\mathbf {w} ^{T}\mathbf {X} )\right\}\mathbf {w} } , where E { . . . } {\displaystyle E\left\{...\right\}} means averaging over all column-vectors of matrix X {\displaystyle \mathbf {X} } Let w ← w + / ‖ w + ‖ {\displaystyle \mathbf {w} \leftarrow \mathbf {w} ^{+}/\|\mathbf {w} ^{+}\|} If not converged, go back to 2 === Multiple component extraction === The single unit iterative algorithm estimates only one weight vector which extracts a single component. Estimating additional components that are mutually "independent" requires repeating the algorithm to obtain linearly independent projection vectors - note that the notion of independence here refers to maximizing non-Gaussianity in the estimated components. Hyvärinen provides several ways of extracting multiple components with the simplest being the following. Here, 1 M {\displaystyle \mathbf {1_{M}} } is a column vector of 1's of dimension M {\displaystyle M} . Algorithm FastICA Input: C {\displaystyle C} Number of desired components Input: X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} Prewhitened matrix, where each column represents an N {\displaystyle N} -dimensional sample, where C <= N {\displaystyle C<=N} Output: W ∈ R N × C {\displaystyle \mathbf {W} \in \mathbb {R} ^{N\times C}} Un-mixing matrix where each column projects X {\displaystyle \mathbf {X} } onto independent component. Output: S ∈ R C × M {\displaystyle \mathbf {S} \in \mathbb {R} ^{C\times M}} Independent components matrix, with M {\displaystyle M} columns representing a sample with C {\displaystyle C} dimensions. for p in 1 to C: w p ← {\displaystyle \mathbf {w_{p}} \leftarrow } Random vector of length N while w p {\displaystyle \mathbf {w_{p}} } changes w p ← 1 M X g ( w p T X ) T − 1 M g ′ ( w p T X ) 1 M w p {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {1}{M}}\mathbf {X} g(\mathbf {w_{p}} ^{T}\mathbf {X} )^{T}-{\frac {1}{M}}g'(\mathbf {w_{p}} ^{T}\mathbf {X} )\mathbf {1_{M}} \mathbf {w_{p}} } w p ← w p − ∑ j = 1 p − 1 ( w p T w j ) w j {\displaystyle \mathbf {w_{p}} \leftarrow \mathbf {w_{p}} -\sum _{j=1}^{p-1}(\mathbf {w_{p}} ^{T}\mathbf {w_{j}} )\mathbf {w_{j}} } w p ← w p ‖ w p ‖ {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {\mathbf {w_{p}} }{\|\mathbf {w_{p}} \|}}} output W ← [ w 1 , … , w C ] {\displaystyle \mathbf {W} \leftarrow {\begin{bmatrix}\mathbf {w_{1}} ,\dots ,\mathbf {w_{C}} \end{bmatrix}}} output S ← W T X {\displaystyle \mathbf {S} \leftarrow \mathbf {W^{T}} \mathbf {X} }

    Read more →