Mivar-based approach

Mivar-based approach

The Mivar-based approach is a mathematical tool for designing artificial intelligence (AI) systems. Mivar (Multidimensional Informational Variable Adaptive Reality) was developed by combining production and Petri nets. The Mivar-based approach was developed for semantic analysis and adequate representation of humanitarian epistemological and axiological principles in the process of developing artificial intelligence. The Mivar-based approach incorporates computer science, informatics and discrete mathematics, databases, expert systems, graph theory, matrices and inference systems. The Mivar-based approach involves two technologies: Information accumulation is a method of creating global evolutionary data-and-rules bases with variable structure. It works on the basis of adaptive, discrete, mivar-oriented information space, unified data and rules representation, based on three main concepts: “object, property, relation”. Information accumulation is designed to store any information with possible evolutionary structure and without limitations concerning the amount of information and forms of its presentation. Data processing is a method of creating a logical inference system or automated algorithm construction from modules, services or procedures on the basis of a trained mivar network of rules with linear computational complexity. Mivar data processing includes logical inference, computational procedures and services. Mivar networks allow us to develop cause-effect dependencies (“If-then”) and create an automated, trained, logical reasoning system. Representatives of Russian association for artificial intelligence (RAAI) – for example, V. I. Gorodecki, doctor of technical science, professor at SPIIRAS and V. N. Vagin, doctor of technical science, professor at MPEI declared that the term is incorrect and suggested that the author should use standard terminology. == History == While working in the Russian Ministry of Defense, O. O. Varlamov started developing the theory of “rapid logical inference” in 1985. He was analyzing Petri nets and productions to construct algorithms. Generally, mivar-based theory represents an attempt to combine entity-relationship models and their problem instance – semantic networks and Petri networks. The abbreviation MIVAR was introduced as a technical term by O. O. Varlamov, Doctor of Technical Science, professor at Bauman MSTU in 1993 to designate a “semantic unit” in the process of mathematical modeling. The term has been established and used in all of his further works. The first experimental systems operating according to mivar-based principles were developed in 2000. Applied mivar systems were introduced in 2015. == Mivar == Mivar is the smallest structural element of discrete information space. == Object-property-relation == Object-Property-Relation (VSO) is a graph, the nodes of which are concepts and arcs are connections between concepts. Mivar space represents a set of axes, a set of elements, a set of points of space and a set of values of points. A = { a n } , n = 1 , … , N , {\displaystyle A=\{a_{n}\},n=1,\ldots ,N,} where: A {\displaystyle A} is a set of mivar space axis names; N {\displaystyle N} is a number of mivar space axes. Then: ∀ a n ∃ F n = { f n i n } , n = 1 , … , N , i n = 1 , … , I n , {\displaystyle \forall a_{n}\exists F_{n}=\{f_{{ni}_{n}}\},n=1,\ldots ,N,i_{n}=1,\ldots ,I_{n},} where: F n {\displaystyle F_{n}} is a set of axis a n {\displaystyle a_{n}} elements; i n {\displaystyle i_{n}} is a set F n {\displaystyle F_{n}} element identifier; I n = | F n | . {\displaystyle I_{n}=|F_{n}|.} F n {\displaystyle F_{n}} sets form multidimensional space: M = F 1 × F 2 × ⋯ × F n . {\displaystyle M=F_{1}\times F_{2}\times \cdots \times F_{n}.} m = ( i 1 , i 2 , … , i N ) , {\displaystyle m=(i_{1},i_{2},\ldots ,i_{N}),} where: m ∈ M {\displaystyle m\in M} ; m {\displaystyle m} is a point of multidimensional space; ( i 1 , i 2 , … , i N ) {\displaystyle (i_{1},i_{2},\ldots ,i_{N})} are coordinates of point m {\displaystyle m} . There is a set of values of multidimensional space points of M {\displaystyle M} : C M = { c i 1 , i 2 , … , i N ∣ i 1 = 1 , … , I 1 , i 2 = 1 , … , I 2 , … , i n = 1 , … , I N } , {\displaystyle C_{M}=\{c_{i_{1},i_{2},\ldots ,i_{N}}\mid i_{1}=1,\ldots ,I_{1},i_{2}=1,\ldots ,I_{2},\ldots ,i_{n}=1,\ldots ,I_{N}\},} where: c i 1 , i 2 , … , i N {\displaystyle c_{i_{1},i_{2},\ldots ,i_{N}}} is a value of the point of multidimensional space M {\displaystyle M} is a value of the point of multidimensional space ( i 1 , i 2 , … , i N ) {\displaystyle (i_{1},i_{2},\ldots ,i_{N})} . For every point of space M {\displaystyle M} there is a single value from C M {\displaystyle C_{M}} set or there is no such value. Thus, C M {\displaystyle C_{M}} is a set of data model state changes represented in multidimensional space. To implement a transition between multidimensional space and set of points values the relation μ {\displaystyle \mu } has been introduced: C x = μ ( M x ) , {\displaystyle C_{x}=\mu (M_{x}),} where: M x ⊆ M ; {\displaystyle M_{x}\subseteq M;} M x = F 1 x × F 2 x × ⋯ × F N x . {\displaystyle M_{x}=F_{1x}\times F_{2x}\times \cdots \times F_{Nx}.} To describe a data model in mivar information space it is necessary to identify three axes: The axis of relations « O {\displaystyle O} »; The axis of attributes (properties) « S {\displaystyle S} »; The axis of elements (objects) of subject domain « V {\displaystyle V} ». These sets are independent. The mivar space can be represented by the following tuple: ⟨ V , S , O ⟩ {\displaystyle \langle V,S,O\rangle } Thus, mivar is described by « V S O {\displaystyle VSO} » formula, in which « V {\displaystyle V} » denotes an object or a thing, « S {\displaystyle S} » denotes properties, « O {\displaystyle O} » variety of relations between other objects of a particular subject domain. The category “Relations” can describe dependencies of any complexity level: formulae, logical transitions, text expressions, functions, services, computational procedures and even neural networks. A wide range of capabilities complicates description of modeling interconnections, but can take into consideration all the factors. Mivar computations use mathematical logic. In a simplified form they can be represented as implication in the form of an "if…, then …” formula. The result of mivar modeling can be represented in the form of a bipartite graph binding two sets of objects: source objects and resultant objects. == Mivar network == Mivar network is a method for representing objects of the subject domain and their processing rules in the form of a bipartite directed graph consisting of objects and rules. A Mivar network is a bipartite graph that can be described in the form of a two-dimensional matrix, in that records information about the subject domain of the current task. Generally, mivar networks provide formalization and representation of human knowledge in the form of a connected multidimensional space. That is, a mivar network is a method of representing a piece of mivar space information in the form of a bipartite, directed graph. The mivar space information is formed by objects and connections, which in total represent the data model of the subject domain. Connections include rules for objects processing. Thus, a mivar network of a subject domain is a part of the mivar space knowledge for that domain. The graph can consist of objects-variables and rules-procedures. First, two lists are made that form two nonintersecting partitions: the list of objects and the list of rules. Objects are denoted by circles. Each rule in a mivar network is an extension of productions, hyper-rules with multi-activators or computational procedures. It is proved that from the perspective of further processing, these formalisms are identical and in fact are nodes of the bipartite graph, denoted by rectangles. === Multi-dimensional binary matrices === Mivar networks can be implemented on single computing systems or service-oriented architectures. Certain constraints restrict their application, in particular, the dimension of matrix of linear matrix method for determining logical inference path on the adaptive rule networks. The matrix dimension constraint is due to the fact that implementation requires sending a general matrix to multiple processors. Since every matrix value is initially represented in symbol form, the amount of sent data is crucial when obtaining, for example, 10000 rules/variables. Classical mivar-based method requires storing three values in each matrix cell: 0 – no value; x – input variable for the rule; y – output variable for the rule. The analysis of possibility of firing a rule is separated from determining output variables according to stages after firing the rule. Consequently, it is possible to use different matrices for “search for fired rules” and “setting values for output variables”. This allowsthe use of multidimensional binary m

Ciscogate

Ciscogate, also known as the Black Hat Bug, is the name given to a legal incident that occurred at the Black Hat Briefings security conference in Las Vegas, Nevada, on July 27, 2005. On the morning of the first day of the conference, July 26, 2005, some attendees noticed that 30 pages of text had been physically ripped out of the extensive conference presentation booklet the night before at the request of Cisco Systems and the CD-ROM with presentation slides was not included. It was determined the pages covered a talk to be given by Michael Lynn, a security researcher with Atlanta-based IBM Internet Security Systems (ISS). Instead of the pages with the details, attendees found a photographed copy of a notice from Black Hat saying "Due to some last minute changes beyond Black Hat's control, and at the request of the presenter, the included materials aren't up to the standards Black Hat tries to meet. Black Hat will be the first to apologize. We hope the vendors involved will follow suit." According to Lynn's lawyer, his employer had approved of the talk leading up to the conference but changed their minds two days before the scheduled talk, forbidding him from presenting. Lynn's original presentation was to cover a vulnerability in Cisco routers. The presentation was one of four scheduled to follow Jeff Moss' keynote address on the first day of the conference, titled "Cisco IOS Security Architecture". After being told by his employer that he could not present on the topic, Lynn chose an alternate topic. Cisco and ISS had offered to give new joint presentation but this was turned down by Black Hat because the original speaking slot was given to Lynn, not Cisco. Lynn's presentation began by covering security issues in services that allow users to make Voice over IP telephone calls. Shortly after beginning the presentation Lynn changed back to his original topic and began disclosing some technical details of the vulnerability he found in Cisco routers stating that he would rather resign from his job at ISS than keep the details private. == Lawsuit == Shortly after Lynn concluded his talk he met Jennifer Granick, who would soon become his lawyer. During their initial meeting Lynn told Granick that he expected to be sued. Later in the evening Lynn had heard that Cisco and ISS had filed a lawsuit and requested a temporary restraining order against Black Hat but not himself. A public relations representative from Black Hat told Granick that the lawsuit was against both Black Hat and Lynn and that the companies had scheduled an Ex parte hearing in San Francisco the next morning to request the restraining order. That night, Andrew Valentine, an attorney for ISS and Cisco called Lynn who directed them to Granick. During the conversation Valentine explained the claims and accusations against Lynn, which included three things: 1) ISS claimed copyright over the presentation that Lynn gave, 2) Cisco claimed copyright over the decompiled machine code obtained from the router which was included in the presentation, and 3) Cisco claimed the presentation contained trade secrets. These complaints were outlined in a civil complaint at the U.S. Northern District of California and filed against both Lynn and Black Hat. According to Granick, she and Valentine were able agree to an injunction to settle the case without court proceedings. This deal was almost called off due to an inadvertent mistake by Black Hat in which they had restored Lynn's presentation on their web server. Black Hat, Granick, and the plaintiff's lawyers were able to resolve this problem and the deal stood. One condition of the settlement required Lynn to provide an image of all computer data he used in his research to be provided to a third party for forensic analysis before erasing his research and any Cisco data from his systems. The settlement also stipulated that Lynn was prohibited from talking about the vulnerability in the future. == FBI Investigation == Shortly after lawyers for Lynn and ISS / Cisco filed settlement papers, FBI agents from the Las Vegas office arrived at the conference to begin asking questions. According to Granick, they were there at the request of the Atlanta FBI office and Lynn was not of interest. Granick asserted the Fifth and Sixth amendment rights on behalf of her client, Lynn. Granick asserted his rights for the Atlanta office and asked if an arrest warrant had been issued for Lynn. Over the next 24 hours Granick was not able to ascertain the status of a warrant but ultimately determined no warrant was issued. When the FBI was asked about the case by a journalist, spokesman Paul Bresson declined to discuss the case saying "Our policy is to not make any comment on anything that is ongoing. That's not to confirm that something is, because I really don't know". Granick would only confirm to journalists that the "investigation has to do with the presentation". == Response == === Attendees === Attendees of Black Hat Briefings, as well as many that also attended DEF CON, were not happy with vendors threatening legal action over vulnerability disclosure. The term "Ciscogate" was coined quickly by an unknown person, but some attendees were quick to create shirts to commemorate the incident. === Cisco === Mojgan Khalili, a senior manager for corporate PR at Cisco, issued a statement to the press saying "It is important to note that the information Mr. Lynn presented was not a disclosure of a new vulnerability or a flaw with Cisco IOS software. Mr. Lynn's research explores possible ways to expand exploitations of existing security vulnerabilities impacting routers." === ISS === Kim Duffy, managing director of ISS Australia, was asked about ISS's response to the incident. Duffy responded that it was "business as usual" as the company handled the incident "strictly by the book". He gave a brief statement to ZDNet UK saying "ISS has published rules for disclosure and that is what we stick to. We didn't care to publish [the disclosure] because we were not ready. We had not completed the research to our satisfaction so it was not ready to be disclosed". ISS spokesperson Roger Fortier confirmed that Lynn was no longer employed with the company and that ISS was still working with Cisco on the matter. He gave a statement to the Washington Post saying "ISS and Cisco have been working on this in the background and didn't feel at this time that the material was ready for publication. The decision was made on Monday to pull the presentation because we wanted to make sure the research was fully baked."

Modes of variation

In statistics, modes of variation are a continuously indexed set of vectors or functions that are centered at a mean and are used to depict the variation in a population or sample. Typically, variation patterns in the data can be decomposed in descending order of eigenvalues with the directions represented by the corresponding eigenvectors or eigenfunctions. Modes of variation provide a visualization of this decomposition and an efficient description of variation around the mean. Both in principal component analysis (PCA) and in functional principal component analysis (FPCA), modes of variation play an important role in visualizing and describing the variation in the data contributed by each eigencomponent. In real-world applications, the eigencomponents and associated modes of variation aid to interpret complex data, especially in exploratory data analysis (EDA). == Formulation == Modes of variation are a natural extension of PCA and FPCA. === Modes of variation in PCA === If a random vector X = ( X 1 , X 2 , ⋯ , X p ) T {\displaystyle \mathbf {X} =(X_{1},X_{2},\cdots ,X_{p})^{T}} has the mean vector μ p {\displaystyle {\boldsymbol {\mu }}_{p}} , and the covariance matrix Σ p × p {\displaystyle \mathbf {\Sigma } _{p\times p}} with eigenvalues λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq \lambda _{p}\geq 0} and corresponding orthonormal eigenvectors e 1 , e 2 , ⋯ , e p {\displaystyle \mathbf {e} _{1},\mathbf {e} _{2},\cdots ,\mathbf {e} _{p}} , by eigendecomposition of a real symmetric matrix, the covariance matrix Σ {\displaystyle \mathbf {\Sigma } } can be decomposed as Σ = Q Λ Q T , {\displaystyle \mathbf {\Sigma } =\mathbf {Q} \mathbf {\Lambda } \mathbf {Q} ^{T},} where Q {\displaystyle \mathbf {Q} } is an orthogonal matrix whose columns are the eigenvectors of Σ {\displaystyle \mathbf {\Sigma } } , and Λ {\displaystyle \mathbf {\Lambda } } is a diagonal matrix whose entries are the eigenvalues of Σ {\displaystyle \mathbf {\Sigma } } . By the Karhunen–Loève expansion for random vectors, one can express the centered random vector in the eigenbasis X − μ = ∑ k = 1 p ξ k e k , {\displaystyle \mathbf {X} -{\boldsymbol {\mu }}=\sum _{k=1}^{p}\xi _{k}\mathbf {e} _{k},} where ξ k = e k T ( X − μ ) {\displaystyle \xi _{k}=\mathbf {e} _{k}^{T}(\mathbf {X} -{\boldsymbol {\mu }})} is the principal component associated with the k {\displaystyle k} -th eigenvector e k {\displaystyle \mathbf {e} _{k}} , with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0\ {\text{for}}\ l\neq k.} Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } is the set of vectors, indexed by α {\displaystyle \alpha } , m k , α = μ ± α λ k e k , α ∈ [ − A , A ] , {\displaystyle \mathbf {m} _{k,\alpha }={\boldsymbol {\mu }}\pm \alpha {\sqrt {\lambda _{k}}}\mathbf {e} _{k},\alpha \in [-A,A],} where A {\displaystyle A} is typically selected as 2 or 3 {\displaystyle 2\ {\text{or}}\ 3} . === Modes of variation in FPCA === For a square-integrable random function X ( t ) , t ∈ T ⊂ R p {\displaystyle X(t),t\in {\mathcal {T}}\subset R^{p}} , where typically p = 1 {\displaystyle p=1} and T {\displaystyle {\mathcal {T}}} is an interval, denote the mean function by μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} , and the covariance function by G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) = ∑ k = 1 ∞ λ k φ k ( s ) φ k ( t ) , {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))=\sum _{k=1}^{\infty }\lambda _{k}\varphi _{k}(s)\varphi _{k}(t),} where λ 1 ≥ λ 2 ≥ ⋯ ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq 0} are the eigenvalues and { φ 1 , φ 2 , ⋯ } {\displaystyle \{\varphi _{1},\varphi _{2},\cdots \}} are the orthonormal eigenfunctions of the linear Hilbert–Schmidt operator G : L 2 ( T ) → L 2 ( T ) , G ( f ) = ∫ T G ( s , t ) f ( s ) d s . {\displaystyle G:L^{2}({\mathcal {T}})\rightarrow L^{2}({\mathcal {T}}),\,G(f)=\int _{\mathcal {T}}G(s,t)f(s)ds.} By the Karhunen–Loève theorem, one can express the centered function in the eigenbasis, X ( t ) − μ ( t ) = ∑ k = 1 ∞ ξ k φ k ( t ) , {\displaystyle X(t)-\mu (t)=\sum _{k=1}^{\infty }\xi _{k}\varphi _{k}(t),} where ξ k = ∫ T ( X ( t ) − μ ( t ) ) φ k ( t ) d t {\displaystyle \xi _{k}=\int _{\mathcal {T}}(X(t)-\mu (t))\varphi _{k}(t)dt} is the k {\displaystyle k} -th principal component with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0{\text{ for }}l\neq k.} Then the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} is the set of functions, indexed by α {\displaystyle \alpha } , m k , α ( t ) = μ ( t ) ± α λ k φ k ( t ) , t ∈ T , α ∈ [ − A , A ] {\displaystyle m_{k,\alpha }(t)=\mu (t)\pm \alpha {\sqrt {\lambda _{k}}}\varphi _{k}(t),\ t\in {\mathcal {T}},\ \alpha \in [-A,A]} that are viewed simultaneously over the range of α {\displaystyle \alpha } , usually for A = 2 or 3 {\displaystyle A=2\ {\text{or}}\ 3} . == Estimation == The formulation above is derived from properties of the population. Estimation is needed in real-world applications. The key idea is to estimate mean and covariance. === Modes of variation in PCA === Suppose the data x 1 , x 2 , ⋯ , x n {\displaystyle \mathbf {x} _{1},\mathbf {x} _{2},\cdots ,\mathbf {x} _{n}} represent n {\displaystyle n} independent drawings from some p {\displaystyle p} -dimensional population X {\displaystyle \mathbf {X} } with mean vector μ {\displaystyle {\boldsymbol {\mu }}} and covariance matrix Σ {\displaystyle \mathbf {\Sigma } } . These data yield the sample mean vector x ¯ {\displaystyle {\overline {\mathbf {x} }}} , and the sample covariance matrix S {\displaystyle \mathbf {S} } with eigenvalue-eigenvector pairs ( λ ^ 1 , e ^ 1 ) , ( λ ^ 2 , e ^ 2 ) , ⋯ , ( λ ^ p , e ^ p ) {\displaystyle ({\hat {\lambda }}_{1},{\hat {\mathbf {e} }}_{1}),({\hat {\lambda }}_{2},{\hat {\mathbf {e} }}_{2}),\cdots ,({\hat {\lambda }}_{p},{\hat {\mathbf {e} }}_{p})} . Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } can be estimated by m ^ k , α = x ¯ ± α λ ^ k e ^ k , α ∈ [ − A , A ] . {\displaystyle {\hat {\mathbf {m} }}_{k,\alpha }={\overline {\mathbf {x} }}\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\mathbf {e} }}_{k},\alpha \in [-A,A].} === Modes of variation in FPCA === Consider n {\displaystyle n} realizations X 1 ( t ) , X 2 ( t ) , ⋯ , X n ( t ) {\displaystyle X_{1}(t),X_{2}(t),\cdots ,X_{n}(t)} of a square-integrable random function X ( t ) , t ∈ T {\displaystyle X(t),t\in {\mathcal {T}}} with the mean function μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} and the covariance function G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))} . Functional principal component analysis provides methods for the estimation of μ ( t ) {\displaystyle \mu (t)} and G ( s , t ) {\displaystyle G(s,t)} in detail, often involving point wise estimate and interpolation. Substituting estimates for the unknown quantities, the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} can be estimated by m ^ k , α ( t ) = μ ^ ( t ) ± α λ ^ k φ ^ k ( t ) , t ∈ T , α ∈ [ − A , A ] . {\displaystyle {\hat {m}}_{k,\alpha }(t)={\hat {\mu }}(t)\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\varphi }}_{k}(t),t\in {\mathcal {T}},\alpha \in [-A,A].} == Applications == Modes of variation are useful to visualize and describe the variation patterns in the data sorted by the eigenvalues. In real-world applications, modes of variation associated with eigencomponents allow to interpret complex data, such as the evolution of function traits and other infinite-dimensional data. To illustrate how modes of variation work in practice, two examples are shown in the graphs to the right, which display the first two modes of variation. The solid curve represents the sample mean function. The dashed, dot-dashed, and dotted curves correspond to modes of variation with α = ± 1 , ± 2 , {\displaystyle \alpha =\pm 1,\pm 2,} and ± 3 {\displaystyle \pm 3} , respectively. The first graph displays the first two modes of variation of female mortality data from 41 countries in 2003. The object of interest is log hazard function between ages 0 and 100 years. The first mode of variation suggests that the variation of female mortality is smaller for ages around 0 or 100, and larger for ages around 25. An appropriate and intuitive interpretation is that mortality around 25 is driven by accidental death, while around 0 or 100, mortality is related to congenital disease or natural death. Compared to female mortality

Boosting (machine learning)

In machine learning (ML), boosting is an ensemble learning method that combines a set of less accurate models (called "weak learners") to create a single, highly accurate model (a "strong learner"). Unlike other ensemble methods that build models in parallel (such as bagging), boosting algorithms build models sequentially. Each new model in the sequence is trained to correct the errors made by its predecessors. This iterative process allows the overall model to improve its accuracy, particularly by reducing bias. Boosting is a popular and effective technique used in supervised learning for both classification and regression tasks. The theoretical foundation for boosting came from a question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined as a classifier that performs only slightly better than random guessing, whereas a strong learner is a classifier that is highly correlated with the true classification. Robert Schapire's affirmative answer to this question in a 1990 paper led to the development of practical boosting algorithms. The first such algorithm was developed by Schapire, with Freund and Schapire later developing AdaBoost, which remains a foundational example of boosting. == Algorithms == While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified. There are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation), and Yoav Freund (boost by majority), were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize. Only algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms. Other algorithms that are similar in spirit to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms. The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses. There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, CatBoost and others. Many boosting algorithms fit into the AnyBoost framework, which shows that boosting performs gradient descent in a function space using a convex cost function. == Object categorization in computer vision == Given images containing various known objects in the world, a classifier can be learned from them to automatically classify the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization. === Problem of object categorization === Object categorization is a typical task of computer vision that involves determining whether or not an image contains some specific category of object. The idea is closely related with recognition, identification, and detection. Appearance based object categorization typically contains feature extraction, learning a classifier, and applying the classifier to new examples. There are many ways to represent a category of objects, e.g. from shape analysis, bag of words models, or local descriptors such as SIFT, etc. Examples of supervised classifiers are Naive Bayes classifiers, support vector machines, mixtures of Gaussians, and neural networks. However, research has shown that object categories and their locations in images can be discovered in an unsupervised manner as well. === Status quo for object categorization === The recognition of object categories in images is a challenging problem in computer vision, especially when the number of categories is large. This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint, scale, and illumination. Background clutter and partial occlusion add difficulties to recognition as well. Humans are able to recognize thousands of object types, whereas most of the existing object recognition systems are trained to recognize only a few, e.g. human faces, cars, simple objects, etc. Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories) have been developed. One means is by feature sharing and boosting. === Boosting for binary categorization === AdaBoost can be used for face detection as an example of binary categorization. The two categories are faces versus background. The general algorithm is as follows: Form a large set of simple features Initialize weights for training images For T rounds Normalize the weights For available features from the set, train a classifier using a single feature and evaluate the training error Choose the classifier with the lowest error Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small) After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a 10 − 5 {\displaystyle 10^{-5}} false positive rate. Another application of boosting for binary categorization is a system that detects pedestrians using patterns of motion and appearance. This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to the Viola-Jones object detection framework. === Boosting for multi-class categorization === Compared with binary categorization, multi-class categorization looks for common features that can be shared across the categories at the same time. They turn to be more generic edge like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, it generalizes better, needs less training data, and requires fewer features to achieve the same performance. The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via converting multi-class classification into a binary one (a set of categories versus the rest), or by introducing a penalty error from the categories that do not have the feature of the classifier. In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. used GentleBoost for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately logarithmically with the number of class, i.e., slower than linear growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors used AdaBoost for boosting. == Convex vs. non-convex boosting algorithms == Boosting algorithms can be based on convex or non-convex optimization algorithms. Convex algorithms, such as AdaBoost and LogitBoost, can be "defeated" by random noise such that they can't learn basic and learnable combinations of weak hypotheses. This limitation was pointed out by Long & Servedio in 2008. However, by 2009, multiple authors demonstrated that boosting algorithms based on non-convex optimization, such as BrownBoost, can learn from nois

Non-negative matrix factorization

Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. NMF finds applications in such fields as astronomy, computer vision, document clustering, missing data imputation, chemometrics, audio signal processing, recommender systems, and bioinformatics. == History == In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution". In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization. It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations. == Background == Let matrix V be the product of the matrices W and H, V = W H . {\displaystyle \mathbf {V} =\mathbf {W} \mathbf {H} \,.} Matrix multiplication can be implemented as computing the column vectors of V as linear combinations of the column vectors in W using coefficients supplied by columns of H. That is, each column of V can be computed as follows: v i = W h i , {\displaystyle \mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,} where vi is the i-th column vector of the product matrix V and hi is the i-th column vector of the matrix H. When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, and H is a p × n matrix then p can be significantly less than both m and n. Here is an example based on a text-mining application: Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix V and, if the factorization worked, it is a reasonable approximation to the input matrix V. From the treatment of matrix multiplication above it follows that each column in the product matrix WH is a linear combination of the 10 column vectors in the features matrix W with coefficients supplied by the coefficients matrix H. This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features. It is useful to think of each feature (column vector) in the features matrix W as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix H represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in W) where each feature is weighted by the feature's cell value from the document's column in H. == Clustering property == NMF has an inherent clustering property, i.e., it automatically clusters the columns of input data V = ( v 1 , … , v n ) {\displaystyle \mathbf {V} =(v_{1},\dots ,v_{n})} . More specifically, the approximation of V {\displaystyle \mathbf {V} } by V ≃ W H {\displaystyle \mathbf {V} \simeq \mathbf {W} \mathbf {H} } is achieved by finding W {\displaystyle W} and H {\displaystyle H} that minimize the error function (using the Frobenius norm) ‖ V − W H ‖ F , {\displaystyle \left\|V-WH\right\|_{F},} subject to W ≥ 0 , H ≥ 0. {\displaystyle W\geq 0,H\geq 0.} , If we furthermore impose an orthogonality constraint on H {\displaystyle \mathbf {H} } , i.e. H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} , then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H {\displaystyle H} gives the cluster membership, i.e., if H k j > H i j {\displaystyle \mathbf {H} _{kj}>\mathbf {H} _{ij}} for all i ≠ k, this suggests that the input data v j {\displaystyle v_{j}} belongs to k {\displaystyle k} -th cluster. The computed W {\displaystyle W} gives the cluster centroids, i.e., the k {\displaystyle k} -th column gives the cluster centroid of k {\displaystyle k} -th cluster. This centroid's representation can be significantly enhanced by convex NMF. When the orthogonality constraint H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too. When the error function to be used is Kullback–Leibler divergence, NMF is identical to the probabilistic latent semantic analysis (PLSA), a popular document clustering method. == Types == === Approximate non-negative matrix factorization === Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to V. The full decomposition of V then amounts to the two non-negative matrices W and H as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative or positive. When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing V into smaller matrices W and H, is that if one's goal is to approximately represent the elements of V by significantly less data, then one has to infer some latent structure in the data. === Convex non-negative matrix factorization === In standard NMF, matrix factor W ∈ R+m × k, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors ( v 1 , … , v n ) {\displaystyle (v_{1},\dots ,v_{n})} . This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal. === Nonnegative rank factorization === In case the nonnegative rank of V is equal to its actual rank, V = WH is called a nonnegative rank factorization (NRF). The problem of finding the NRF of V, if it exists, is known to be NP-hard. === Different cost functions and regularizations === There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between V and WH and possibly by regularization of the W and/or H matrices. Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the original Kullback–Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. The factorization problem in the squared error version of NMF may be stated as: Given a matrix V {\displaystyle \mathbf {V} } find nonnegative matrices W and H that minimize the function F ( W , H ) = ‖ V − W H ‖ F 2 {\displaystyle F(\mathbf {W} ,\mathbf {H} )=\left\|\mathbf {V} -\mathbf {WH} \right\|_{F}^{2}} Another type of NMF for images is based on the total variation norm. When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resulting problem may be called non-negative sparse coding due to the similarity to the sparse coding problem, although it may also still be referred to as NMF. === Online NMF === Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided in streaming fashion. One such use is for collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for o

Evntlive

Evntlive was an interactive digital concert venue that allowed music fans worldwide to stream concerts to their computer, tablet, or phone. Based in Redwood City, CA, EVNTLIVE Beta launched on April 15, 2013. EVNTLIVE provided users with the ability to switch camera angles, view All Access interviews and clips from artists, buy music, and chat with other online concert-goers in the in-app feature. Users could watch live and on-demand concerts with both free and pay-per-view concerts offered. In its first two months, EVNTLIVE streamed live performances of popular artists ranging from Bon Jovi to Wale, as well as music festivals such as Taste of Country and Mountain Jam; including performances by The Lumineers, Gary Clark Jr., Phil Lesh & Friends, Primus, and more. On December 6, 2013, Evntlive was acquired and absorbed by Yahoo!. The site ceased operations and redirected viewers to Yahoo! Music and Yahoo! Screen promptly afterwards. == About the Platform == EvntLive is an HTML5, web-based platform available on laptops, iPads, and mobile devices. Users must register for a free account on Evntlive’s website in order to reserve tickets and access live and on-demand content. Once they reserve tickets, they can view All Access features from their favorite artists or bands, purchase music, and interact with other online audience members using Buzz. Users can also switch between alternate camera angles as though they are on the concert floor - sharing the experience with their friends online in real-time. EvntLive was acquired by Yahoo in December 2013 == Artists == Bon Jovi Wale Escape the Fate The Parlotones === Taste of Country Music Festival === Trace Adkins Willie Nelson Justin Moore Montgomery Gentry Craig Campbell Blackberry Smoke Gloriana Dustin Lynch LoCash Cowboys Rachel Farley Parmalee Joe Nichols === Mountain Jam Music Festival === Source: The Lumineers Primus Widespread Panic Gov't Mule Phil Lesh The Avett Brothers Dispatch Rubblebucket Michael Franti Jackie Greene Deer Tick Gary Clark Jr. ALO The London Souls Nicki Bluhm Amy Helm The Lone Bellow The Revivalists Swear and Shake Roadkill Ghost Choir Michael Bernard Fitzgerald Michele Clark 's Sunset Sessions Semi Precious Weapons Dale Earnhardt Jr. Jr. DigiTour Media Pentatonix Allstar Weekend Tyler Ward === Launch Music Festival ===

Vowpal Wabbit

Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's interactive learning support is particularly notable including Contextual Bandits, Active Learning, and forms of guided Reinforcement Learning. Vowpal Wabbit provides an efficient scalable out-of-core implementation with support for a number of machine learning reductions, importance weighting, and a selection of different loss functions and optimization algorithms. == Notable features == The VW program supports: Multiple supervised (and semi-supervised) learning problems: Classification (both binary and multi-class) Regression Active learning (partially labeled data) for both regression and classification Multiple learning algorithms (model-types / representations) OLS regression Matrix factorization (sparse matrix SVD) Single layer neural net (with user specified hidden layer node count) Searn (Search and Learn) Latent Dirichlet Allocation (LDA) Stagewise polynomial approximation Recommend top-K out of N One-against-all (OAA) and cost-sensitive OAA reduction for multi-class Weighted all pairs Contextual-bandit (with multiple exploration/exploitation strategies) Multiple loss functions: squared error quantile hinge logistic poisson Multiple optimization algorithms Stochastic gradient descent (SGD) BFGS Conjugate gradient Regularization (L1 norm, L2 norm, & elastic net regularization) Flexible input - input features may be: Binary Numerical Categorical (via flexible feature-naming and the hash trick) Can deal with missing values/sparse-features Other features On the fly generation of feature interactions (quadratic and cubic) On the fly generation of N-grams with optional skips (useful for word/language data-sets) Automatic test-set holdout and early termination on multiple passes bootstrapping User settable online learning progress report + auditing of the model Hyperparameter optimization == Scalability == Vowpal wabbit has been used to learn a tera-feature (1012) data-set on 1000 nodes in one hour. Its scalability is aided by several factors: Out-of-core online learning: no need to load all data into memory The hashing trick: feature identities are converted to a weight index via a hash (uses 32-bit MurmurHash3) Exploiting multi-core CPUs: parsing of input and learning are done in separate threads. Compiled C++ code