Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization. The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large margin nearest neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule. == Setup == The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set D = { ( x → 1 , y 1 ) , … , ( x → n , y n ) } ⊂ R d × C {\displaystyle D=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}\subset R^{d}\times C} , where the set of possible class categories is C = { 1 , … , c } {\displaystyle C=\{1,\dots ,c\}} . The algorithm learns a pseudometric of the type d ( x → i , x → j ) = ( x → i − x → j ) ⊤ M ( x → i − x → j ) {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})=({\vec {x}}_{i}-{\vec {x}}_{j})^{\top }\mathbf {M} ({\vec {x}}_{i}-{\vec {x}}_{j})} . For d ( ⋅ , ⋅ ) {\displaystyle d(\cdot ,\cdot )} to be well defined, the matrix M {\displaystyle \mathbf {M} } needs to be positive semi-definite. The Euclidean metric is a special case, where M {\displaystyle \mathbf {M} } is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis metric. Figure 1 illustrates the effect of the metric under varying M {\displaystyle \mathbf {M} } . The two circles show the set of points with equal distance to the center x → i {\displaystyle {\vec {x}}_{i}} . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid. The algorithm distinguishes between two types of special data points: target neighbors and impostors. === Target neighbors === Target neighbors are selected before learning. Each instance x → i {\displaystyle {\vec {x}}_{i}} has exactly k {\displaystyle k} different target neighbors within D {\displaystyle D} , which all share the same class label y i {\displaystyle y_{i}} . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point x → i {\displaystyle {\vec {x}}_{i}} as N i {\displaystyle N_{i}} . === Impostors === An impostor of a data point x → i {\displaystyle {\vec {x}}_{i}} is another data point x → j {\displaystyle {\vec {x}}_{j}} with a different class label (i.e. y i ≠ y j {\displaystyle y_{i}\neq y_{j}} ) which is one of the nearest neighbors of x → i {\displaystyle {\vec {x}}_{i}} . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set. == Algorithm == Large margin nearest neighbors optimizes the matrix M {\displaystyle \mathbf {M} } with the help of semidefinite programming. The objective is twofold: For every data point x → i {\displaystyle {\vec {x}}_{i}} , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector x → i {\displaystyle {\vec {x}}_{i}} to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k = 3 {\displaystyle k=3} nearest neighbor rule. The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors ∑ i , j ∈ N i d ( x → i , x → j ) {\displaystyle \sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})} . The second goal is achieved by penalizing distances to impostors x → l {\displaystyle {\vec {x}}_{l}} that are less than one unit further away than target neighbors x → j {\displaystyle {\vec {x}}_{j}} (and therefore pushing them out of the local neighborhood of x → i {\displaystyle {\vec {x}}_{i}} ). The resulting value to be minimized can be stated as: ∑ i , j ∈ N i , l , y l ≠ y i [ d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ] + {\displaystyle \sum _{i,j\in N_{i},l,y_{l}\neq y_{i}}[d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})]_{+}} With a hinge loss function [ ⋅ ] + = max ( ⋅ , 0 ) {\textstyle [\cdot ]_{+}=\max(\cdot ,0)} , which ensures that impostor proximity is not penalized when outside the margin. The margin of exactly one unit fixes the scale of the matrix M {\displaystyle M} . Any alternative choice c > 0 {\displaystyle c>0} would result in a rescaling of M {\displaystyle M} by a factor of 1 / c {\displaystyle 1/c} . The final optimization problem becomes: min M ∑ i , j ∈ N i d ( x → i , x → j ) + λ ∑ i , j , l ξ i j l {\displaystyle \min _{\mathbf {M} }\sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})+\lambda \sum _{i,j,l}\xi _{ijl}} ∀ i , j ∈ N i , l , y l ≠ y i {\displaystyle \forall _{i,j\in N_{i},l,y_{l}\neq y_{i}}} d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ≤ ξ i j l {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})\leq \xi _{ijl}} ξ i j l ≥ 0 {\displaystyle \xi _{ijl}\geq 0} M ⪰ 0 {\displaystyle \mathbf {M} \succeq 0} The hyperparameter λ > 0 {\textstyle \lambda >0} is some positive constant (typically set through cross-validation). Here the variables ξ i j l {\displaystyle \xi _{ijl}} (together with two types of constraints) replace the term in the cost function. They play a role similar to slack variables to absorb the extent of violations of the impostor constraints. The last constraint ensures that M {\displaystyle \mathbf {M} } is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime (i.e. the set of variables ξ i j l {\displaystyle \xi _{ijl}} is sparse). A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness. == Extensions and efficient solvers == LMNN was extended to multiple local metrics in the 2008 paper. This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page. Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.
Qloo
Qloo (pronounced "clue") is a company that uses artificial intelligence (AI) to understand taste and cultural correlations. It provides companies with an application programming interface (API). It received funding from Leonardo DiCaprio, Elton John, Barry Sternlicht, Pierre Lagrange and others. Qloo establishes consumer preference correlations via machine learning across data spanning cultural domains including music, film, television, dining, nightlife, fashion, books, and travel. The recommender system uses AI to predict correlations for further applications. == History == Qloo was founded in 2012 by chief executive officer Alex Elias and chief operating officer Jay Alger. Qloo initially launched an app designed for consumers, allowing them to understand their own tastes and receive personalized recommendations. The company amassed several million users and built a large catalog of cultural entities and corresponding user sentiment. In 2012, Qloo raised $1.4 million in seed funding from investors including Cedric the Entertainer, and venture capital firm Kindler Capital. Qloo had a public beta release in November 2012 after its initial funding. In 2013, the company raised an additional $1.6 million from Cross Creek Pictures founding partner Tommy Thompson, and Samih Toukan and Hussam Khoury, founders of Maktoob, an Internet services company purchased by Yahoo! for $164 million in 2009. On November 14, 2013, a website and an iPhone app were announced. The company later released an Android app, and tablet versions, in mid-2014. In 2015, Twitter approached Qloo about powering personalized social feeds and targeted eCommerce ads on the platform based on what users were posting. Qloo developed an enterprise-grade API to support Twitter’s needs. Twitter ended up pivoting to enable brands to use the social platform for customer service and support, but Qloo was able to sell access to its cultural intelligence via API to many other enterprise clients, marking the official transition from a B2C company to a B2B company. In 2016, Qloo secured $4.5 million in venture capital investment. The $4.5 million was split between a number of investors, including Barry Sternlicht, Pierre Lagrange, and Leonardo DiCaprio. In July 2017, Qloo raised $6.5 million in funding rounds from AXA Strategic Ventures, and Elton John. Following the investment, the founders stated in an interview with Tech Crunch that they would use the investment to expand Qloo's database. They hoped the move would secure larger contracts with corporate clients. At the time, clients already included Fortune 500 companies such as Twitter, PepsiCo, and BMW. In 2019, the company announced that it had acquired cultural recommendation service TasteDive, with Alex Elias becoming chairman of TasteDive. In September 2019, Qloo was named among the Top 14 Artificial Intelligence APIs by ProgrammableWeb. In 2022, Qloo raised $15M in Series B funding from Eldridge and AXA Venture Partners, enabling the privacy-centric AI leader to expand its team of world-class data scientists, enrich its technology, and build on its sales channels in order to continue to offer premier insights into global consumer taste for Fortune 500 companies across the globe. Qloo was recognized as the "Best Decision Intelligence Company" at the 2023 AI Breakthrough Awards. Also in 2023, the company was awarded a Top Performer Award by SourceForge. As of 2024, Qloo is a three-time Inc. 5000 honoree: No. 360 (2022), No. 344 (2021), No. 187 (2020). Qloo raised $25 million Series C round on February 21, 2024. The round was led by AI Ventures with participation from AXA Venture Partners, Eldridge, and Moderne Ventures, allowing Qloo to address new commercial surface areas for Taste AI, including on-device learning and foundational models leveraging Qloo, as well as introduce self-service platform to make consumer and taste analytics available to small and mid-sized enterprises and individuals. Qloo also announced pursuing opportunistic M&A using its balance sheet along the lines of the TasteDive acquisition completed, which expanded Qloo's first-party data moat and corpus of cultural learning. This latest financing brought the total amount raised since the company's founding in 2012 to over $56 million. == Services and features == Qloo calls itself a cultural AI platform to provide real-time correlation data across domains of culture and entertainment including: film, music, television, dining, nightlife, fashion, books, and travel. Each category contains subcategories. Qloo’s knowledge of a user's taste in one category can be utilized to offer suggestions in other categories. Users then rate the suggestions, providing it with feedback for future suggestions. Qloo has partnerships with companies such as Expedia and iTunes. == Technology == Qloo’s Taste AI technology uses machine learning to decode and predict consumers’ interests, maintaining user anonymity. It is powered by 3.7 billion lifestyle entities (brands, music, film, TV, dining, nightlife, fashion, books, travel, and more) and trillions of anonymized consumer behavioral signals. Through AI, Qloo identifies patterns in these data signals, making predictions about how much interest a person or group has in a concept or thing. Central to Qloo’s technology are algorithms designed to detect and mitigate biases within datasets and models, allowing Qloo to assess the fairness of its AI systems with a focus on attributes such as age, gender, and race, enabling the company to fine-tune its AI models to align with their ethical standards. They also use visualization tools to probe the behavior of their AI models for conducting counterfactual analyses and for comparing the performances of the AI models across diverse demographic segments. Qloo’s Taste AI doesn’t collect or use any Personally Identifiable Information (PII). Instead, it derives recommendations for audience segments based on co-occurrences between lifestyle entities and anonymized behavioral signals. == Applications == Starbucks uses Qloo to create in-store music playlists tailored to specific neighborhoods. Hershey’s uses Qloo to customize the content of assorted candy bags. Michelin uses Qloo to serve recommendations in its Michelin Guide app. Netflix leverages Qloo’s technology to enhance merchandising by identifying actors who resonate with certain demographics. Qloo also works with PepsiCo, Samsung, The New York Mets, BuzzFeed, and Ticketmaster, Universal Music Group, and OOH advertising company JCDecaux.
Induction of regular languages
In computational learning theory, induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way (see language identification in the limit), approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction. == Definitions == A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols (a.k.a. alphabet), a regular expression can be any of ∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a (where a is any character in Σ; denoting the singleton set just containing the single-character string a), r + s (where r and s are, in turn, simpler regular expressions; denoting their set's union) r ⋅ s (denoting the set of all possible concatenations of strings from r's and s's set), r + (denoting the set of n-fold repetitions of strings from r's set, for any n ≥ 1), or r (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition). For example, using Σ = {0,1}, the regular expression (0+1+ε)⋅(0+1) denotes the set of all binary numbers with one or two digits (leading zero allowed), while 1⋅(0+1)⋅0 denotes the (infinite) set of all even binary numbers (no leading zeroes). Given a set of strings (also called "positive examples"), the task of regular language induction is to come up with a regular expression that denotes a set containing all of them. As an example, given {1, 10, 100}, a "natural" description could be the regular expression 1⋅0, corresponding to the informal characterization "a 1 followed by arbitrarily many (maybe even none) 0's". However, (0+1) and 1+(1⋅0)+(1⋅0⋅0) is another regular expression, denoting the largest (assuming Σ = {0,1}) and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively. Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples. == Lattice of automata == Dupont et al. have shown that the set of all structurally complete finite automata generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively. Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation. For the above example string set {1, 10, 100}, the picture shows at its bottom the undergeneralized automaton Aa,b,c,d in grey, consisting of states a, b, c, and d. On the state set {a,b,c,d}, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L(Aa,b,c,d / E) obtains the partially ordered set shown in the picture. Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown below the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's. If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not. Most interesting are those automata immediately below the border. In the picture, separation borders are shown for the negative example strings 11 (green), 1001 (blue), 101 (cyan), and 0 (red). Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm. To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples. Later, they investigate several ordering relations on the set of all possible state fusions. Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches (sketched below): k-reversible languages and the "tail clustering" follow-up approach, Successor automata and the predecessor-successor method, and pumping-based approaches (framework-integration challenged by Luzeaux, however). Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization. == Approaches == === k-reversible languages === Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k. Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if: ∀a0, ..., ak ∈ Σ ∀s1, s2 ∈ Q: δ(s1, a0...ak) = δ(s2, a0...ak) ⇒ s1 = s2, where δ means the homomorphic extension of δ to arbitrary words. Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k = 0, the algorithm has even almost linear complexity. The required state uniqueness after k + 1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton. This algorithm has been used to learn simple parts of English syntax; later, an incremental version has been provided. Another approach based on k-reversible automata is the tail clustering method. === Successor automata === From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states. For example, the singleton input set {aabbaabb} leads to an automaton corresponding to the regular expression (a+⋅b+). An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene + and then includes for each character the set of its possible predecessors in its state. Successor automata can learn exactly the class of local languages. Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate (depending on the intended application) homomorphism is provided. In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method. The learnability of local languages can be reduced to that of k-reversible languages. === Early approaches === Chomsky and Miller (1957) used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw, ..., uvkw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma. === Cover automata === Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language. Given such a language F, they search a so-called cover automaton A such that its language L(A) covers F in the following sense: L(A) ∩ Σ≤ l = F, where l is the length of the longest string in F, and Σ≤ l denotes the set of all strings not longer than l. If such a cover automaton exists, F is uniquely determined by A and l. For example, F = {ad, read, reread } has l = 6 and a cover automaton corresponding to the regular expression (r⋅e)⋅a⋅d. For two strings x and y, Câmpeanu et al. define x ~ y if xz ∈ F ⇔ yz ∈ F for all strings z of a length such that both xz and yz are not longer than l. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O(n4) algorithm to construct from F a cover automaton A of minimal state count. Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata. Păun et al. improve the time complexity to O(n2). === Residual automata === For a set S of strings and a string u, the Brzozowski derivative u−1S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = {v ∈ Σ: uv ∈ S}, cf. picture. Denis et al. define a
Genotypic and phenotypic repair
Genotypic and phenotypic repair are optional components of an evolutionary algorithm (EA). An EA reproduces essential elements of biological evolution as a computer algorithm in order to solve demanding optimization or planning tasks, at least approximately. A candidate solution is represented by a - usually linear - data structure that plays the role of an individual's chromosome. New solution candidates are generated by mutation and crossover operators following the example of biology. These offspring may be defective, which is corrected or compensated for by genotypic or phenotypic repair. == Description == Genotypic repair, also known as genetic repair, is the removal or correction of impermissible entries in the chromosome that violate restrictions. In phenotypic repair, the corrections are only made in the genotype-phenotype mapping and the chromosome remains unchanged. Michalewicz wrote about the importance of restrictions in real-world applications: "In general, constraints are an integral part of the formulation of any problem". Restriction violations are application-specific and therefore it depends on the current problem whether and which type of repair is useful. They can usually also be treated by a correspondingly extended evaluation and it depends on the problem which measures are possible and which is the most suitable. If a phenotypic repair is feasible, then it is usually the most efficient compared to the other measures. A survey on repair methods used as constraint handling techniques can be found in. Violations of the range limits of genes should be avoided as far as possible by the formulation of the genome. If this is not possible or if restrictions within the search space defined by the genome are involved, their violations are usually handled by the evaluation. This can be done, for example, by penalty functions that lower the fitness. Repair is often also required for combinatorial tasks. The application of a 1- or n-point crossover operator can, for example, lead to genes being missing in one of the child genomes that are present in duplicate in the other. In this case, a suitable genotypic repair measure is to move the surplus genes to the other genome in a positional manner. The use of the aforementioned operators in combinatorial tasks has also proven to be useful in combination with crossover types specially developed for permutations, at least for certain problems. Particularly in combinatorial problems, it has been observed that genotypic repair can promote premature convergence to a suboptimum, but can also significantly accelerate a successful search. Studies on various tasks have shown that this is application-dependent. An effective measure to avoid premature convergence is generally the use of structured populations instead of the usual panmictic ones. Sequence restrictions play a role in many scheduling tasks, for example when it comes to planning workflows. If, for example, it is specified that step A must be carried out before step B and the gene of step B is located before the gene of A in the chromosome, then there is an impermissible gene sequence. This is because the scheduling operation of step B requires the planned end of step A for correct scheduling, but this is not yet scheduled at the time gene B is processed. The problem can be solved in two ways: The scheduling operation of step B is postponed until the gene from step A has been processed. The genome remains unchanged and the repair only influences the genotype-phenotype mapping. Since only the phenotype is changed, this is referred to as phenotypic repair. If, on the other hand, the gene of step B is moved behind the gene of step A, this is a genotypic repair. The same applies to the alternative shift of gene A in front of gene B. In this case, genotypic repair has the disadvantage that it prevents a meaningful restructuring of the gene sequence in the chromosome if this requires several intermediate steps (mutations) that at least partially violate restrictions.
Consensus clustering
Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning. == Issues with existing clustering techniques == Current clustering techniques do not address all the requirements adequately. Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; Effectiveness of the method depends on the definition of "distance" (for distance-based clustering) If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways. == Justification for using consensus clustering == There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and k-means clustering circumvent some of the shortcomings of hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori. == The Monti consensus clustering algorithm == The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, K {\displaystyle K} . Given a dataset of N {\displaystyle N} total number of points to cluster, this algorithm works by resampling and clustering the data, for each K {\displaystyle K} and a N × N {\displaystyle N\times N} consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal K {\displaystyle K} . More specifically, given a set of points to cluster, D = { e 1 , e 2 , . . . e N } {\displaystyle D=\{e_{1},e_{2},...e_{N}\}} , let D 1 , D 2 , . . . , D H {\displaystyle D^{1},D^{2},...,D^{H}} be the list of H {\displaystyle H} perturbed (resampled) datasets of the original dataset D {\displaystyle D} , and let M h {\displaystyle M^{h}} denote the N × N {\displaystyle N\times N} connectivity matrix resulting from applying a clustering algorithm to the dataset D h {\displaystyle D^{h}} . The entries of M h {\displaystyle M^{h}} are defined as follows: M h ( i , j ) = { 1 , if points i and j belong to the same cluster 0 , otherwise {\displaystyle M^{h}(i,j)={\begin{cases}1,&{\text{if}}{\text{ points i and j belong to the same cluster}}\\0,&{\text{otherwise}}\end{cases}}} Let I h {\displaystyle I^{h}} be the N × N {\displaystyle N\times N} identicator matrix where the ( i , j ) {\displaystyle (i,j)} -th entry is equal to 1 if points i {\displaystyle i} and j {\displaystyle j} are in the same perturbed dataset D h {\displaystyle D^{h}} , and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix C {\displaystyle C} is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every K {\displaystyle K} . C ( i , j ) = ( ∑ h = 1 H M h ( i , j ) ∑ h = 1 H I h ( i , j ) ) {\displaystyle C(i,j)=\left({\frac {\textstyle \sum _{h=1}^{H}M^{h}(i,j)\displaystyle }{\sum _{h=1}^{H}I^{h}(i,j)}}\right)} That is the entry ( i , j ) {\displaystyle (i,j)} in the consensus matrix is the number of times points i {\displaystyle i} and j {\displaystyle j} were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range [ 0 , 1 ] {\displaystyle [0,1]} . A consensus matrix is calculated for each K {\displaystyle K} to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal K {\displaystyle K} . One way of quantifying the stability of the K {\displaystyle K} th consensus matrix is examining its CDF curve (see below). == Over-interpretation potential of the Monti consensus clustering algorithm == Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study. If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research, and has been addressed by methods such as SigClust and the GAP-statistic. However, these methods rely on certain assumptions for the null model that may not always be appropriate. Şenbabaoğlu et al demonstrated the original delta K metric to decide K {\displaystyle K} in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the K {\displaystyle K} value having the lowest PAC. == Related work == Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm
Logical Machine Corporation
Logical Machine Corporation (LOMAC) was an American computer company active from the mid-1970s to the 1980s and based in the San Francisco Bay Area. It was founded as John Peers and Company by the British entrepreneur John Peers in 1974. LOMAC developed the ADAM, a minicomputer which ran a specialized compiler for the company's natural English programming language. Throughout the late 1970s, the company acquired several technology firms, including Byte, Inc., the owner of the Byte Shop retail chain. Despite its unique approach to computing and earning $5 million in revenue in 1977, LOMAC struggled as the industry began to standardize around the IBM Personal Computer (IBM PC). Following Peers's departure in 1980, the company rebranded as Logical Business Machines, Inc. (LBM, or simply Logical), and attempted to pivot toward IBM PC–compatible hardware. However, financial difficulties led to the company filing for Chapter 11 bankruptcy in 1984. After emerging from bankruptcy in 1985 with new investment, Logical ceased hardware manufacturing to focus exclusively on software development and value-added reselling. == History == John Peers (born 1942) founded Logical Machine Corporation as John Peers and Company in September 1974. The company originally occupied a 4,500-square-foot office in Burlingame, California. The company was Peers' fourth; he had recently sold off Allied Business Systems of London to Trafalgar House in 1974. Peers sought to set up manufacturing in an agricultural zone in Ukiah, California. Following a delay, caused in part by concerned residents, a 30,000-square-foot plant was raised in Burke Hill, three miles south of Ukiah. The Ukiah plant was built to mass manufacture the company's ADAM minicomputer. The ADAM computer ran a specialized compiler for the company's natural English programming language; that is to say, the programming language attempted to closely emulate English syntax. Prototypes of the ADAM were built in May 1974, based on specifications devised in October 1973. Peers had yet to patent the technology as of June 1975. The ADAM's central processing unit was bolted onto an 7-by-6-foot L-shaped desk, on which rested its terminal. Twenty units of the ADAM were installed between April 1975 and February 1976, out of a backlog of orders for 3,500 from 500 clients, manufactured out of the company's Burlingame headquarters. It cost US$40,000. A controversial print advertisement featuring a naked woman seated at an ADAM terminal—as a pastiche of Adam and Eve—was recalled in early 1976 as a result of outcry from the National Organization for Women. The company changed its name to Logical Machine Corporation (LOMAC) in October 1976 and moved its headquarters to a 26,000-square-foot building in Sunnyvale, California, in anticipation of a ramping up of orders for the ADAM. The company originally occupied half of the building; they later purchased the other half from the tenant in July 1977 to double its manufacturing output. For fiscal year 1977, the company earned $5 million in revenue. In December 1977, LOMAC acquired Byte, Inc.—the proprietor of The Byte Shop, the first computer retail chain—from Paul Terrell and Boyd Wilson for an unspecified amount. The Byte Shop had 65 locations in the San Francisco Bay Area in 1978; it catered mainly to hobbyists with low cost microcomputer kits, in contrast to the high cost of LOMAC's ADAM. By July 1978, however, LOMAC were able to reduce the price of the ADAM down to $15,000. The company by that point had shipped their 50th ADAM and expanded to 14 countries. Also in 1978, LOMAC acquired Mass Memory—a high-tech optical storage company based in Phoenix, Arizona, whose products had storage capacities on the order gigabytes and terabytes—and Centigram, makers of the Mike—a computer with speech recognition. Later that year, the company introduced Tina, a low-cost version of the ADAM. LOMAC suffered losses that year and appointed Jerry Brandt to the board of directions, naming him chief operating officer, in August 1978. Brandt had Logical absorb Mass Memory and Centigram into the parent operations, shutting down their respective plants in the process, converted 10 Byte Shops to franchises and opened 25 more franchised Byte locations, and stopped direct sales of LOMAC's business computer products. By the beginning of 1979, LOMAC was profitable once more, and Brandt was let go from LOMAC. Peers left LOMAC in 1980, following a slump in the company's sales. He became an executive director of the United States Robotics Society, a consortium for industrial automation companies, that year. Following Peers' departure, LOMAC changed its name to Logical Business Machines, adopting the name of its European subsidiary. In 1983, the company announced a 16-bit clone of the IBM PC, called the Logical L-XT, which featured a 10-MB hard drive, 320-KB floppy drive and 192 KB of RAM, and a real-time clock, and came shipped with various software (including MS-DOS, a word processor, and a spreadsheet application) and an amber CRT monitor. The following year, the company introduced L-NET, a local area network system based on the L-XT that could link up to 64 computers. L-NET came shipped with a natural programming language, Diplomat—a descendant of the programming language used on the ADAM. In June 1983, Logical sued Coleco Industries over trademark infringement with the latter's to-be-released Adam microcomputer. Logical cited confusion from their existing ADAM customer base caused by the announcement of the Coleco Adam as the basis for the suit. Coleco challenged Logical in the press, writing that Logical's rights to the Adam trademark for use in computers had lapsed earlier in the year. The two settled out of court, with Coleco agreeing to license the Adam name from Logical in exchange for unlimited rights to the Adam trademark. Logical halted development of the L-XT when they filed for Chapter 11 bankruptcy in July 1984. The company had been $4 million in debt. They emerged from bankruptcy in September 1985, after being infused with $2 million from Carat Ltd. The latter immediately received a little less than 50 percent ownership in Logical—this stake set to grow to over 50 percent over the next six months. As part of the terms of exiting bankruptcy, Logical stopped manufacturing hardware and strictly became a software development company and value-added reseller of computer systems.
Natarajan dimension
In the theory of Probably Approximately Correct Machine Learning, the Natarajan dimension characterizes the complexity of learning a set of functions, generalizing from the Vapnik–Chervonenkis dimension for boolean functions to multi-class functions. Originally introduced as the Generalized Dimension by Natarajan, it was subsequently renamed the Natarajan Dimension by Haussler and Long. == Definition == Let H {\displaystyle H} be a set of functions from a set X {\displaystyle X} to a set Y {\displaystyle Y} . H {\displaystyle H} shatters a set C ⊂ X {\displaystyle C\subset X} if there exist two functions f 0 , f 1 ∈ H {\displaystyle f_{0},f_{1}\in H} such that For every x ∈ C , f 0 ( x ) ≠ f 1 ( x ) {\displaystyle x\in C,f_{0}(x)\neq f_{1}(x)} . For every B ⊂ C {\displaystyle B\subset C} , there exists a function h ∈ H {\displaystyle h\in H} such that for all x ∈ B , h ( x ) = f 0 ( x ) {\displaystyle x\in B,h(x)=f_{0}(x)} and for all x ∈ C − B , h ( x ) = f 1 ( x ) {\displaystyle x\in C-B,h(x)=f_{1}(x)} . The Natarajan dimension of H is the maximal cardinality of a set shattered by H {\displaystyle H} . It is easy to see that if | Y | = 2 {\displaystyle |Y|=2} , the Natarajan dimension collapses to the Vapnik–Chervonenkis dimension. Shalev-Shwartz and Ben-David present comprehensive material on multi-class learning and the Natarajan dimension, including uniform convergence and learnability. Recently, Cohen et al showed that the Natarajan dimension is the dominant term governing agnostic multi-class PAC learnability.