AI Assistant Roblox

AI Assistant Roblox — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Wumpus world

    Wumpus world

    Wumpus world is a simple world use in artificial intelligence for which to represent knowledge and to reason. Wumpus world was introduced by Michael Genesereth, and is discussed in the Russell-Norvig Artificial Intelligence book Artificial Intelligence: A Modern Approach. Wumpus World is loosely inspired by the 1972 video game Hunt the Wumpus. == Problem description == In Artificial Intelligence: A Modern Approach, the wumpus world features a 4x4 grid, containing a monster called a wumpus, multiple bottomless pits and hidden gold. The agent starts at (1,1) and has to find the gold and return to the starting position. The agent loses 1 point for every move and gains 1000 points for bringing the gold to the starting position. The agent can sense pits by a breeze, stench indicates a wumpus, and sparkle indicates gold. The wumpus can be killed by an arrow but costs 10 points.

    Read more →
  • Occam learning

    Occam learning

    In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set. Occam learnability implies PAC learning, and for a wide variety of concept classes, the converse is also true: PAC learnability implies Occam learnability. == Introduction == Occam Learning is named after Occam's razor, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al. that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, parsimony (of the output hypothesis) implies predictive power. == Definition of Occam learning == The succinctness of a concept c {\displaystyle c} in concept class C {\displaystyle {\mathcal {C}}} can be expressed by the length s i z e ( c ) {\displaystyle size(c)} of the shortest bit string that can represent c {\displaystyle c} in C {\displaystyle {\mathcal {C}}} . Occam learning connects the succinctness of a learning algorithm's output to its predictive power on unseen data. Let C {\displaystyle {\mathcal {C}}} and H {\displaystyle {\mathcal {H}}} be concept classes containing target concepts and hypotheses respectively. Then, for constants α ≥ 0 {\displaystyle \alpha \geq 0} and 0 ≤ β < 1 {\displaystyle 0\leq \beta <1} , a learning algorithm L {\displaystyle L} is an ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm for C {\displaystyle {\mathcal {C}}} using H {\displaystyle {\mathcal {H}}} iff, given a set S = { x 1 , … , x m } {\displaystyle S=\{x_{1},\dots ,x_{m}\}} of m {\displaystyle m} samples labeled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} , L {\displaystyle L} outputs a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} such that h {\displaystyle h} is consistent with c {\displaystyle c} on S {\displaystyle S} (that is, h ( x ) = c ( x ) , ∀ x ∈ S {\displaystyle h(x)=c(x),\forall x\in S} ), and s i z e ( h ) ≤ ( n ⋅ s i z e ( c ) ) α m β {\displaystyle size(h)\leq (n\cdot size(c))^{\alpha }m^{\beta }} where n {\displaystyle n} is the maximum length of any sample x ∈ S {\displaystyle x\in S} . An Occam algorithm is called efficient if it runs in time polynomial in n {\displaystyle n} , m {\displaystyle m} , and s i z e ( c ) . {\displaystyle size(c).} We say a concept class C {\displaystyle {\mathcal {C}}} is Occam learnable with respect to a hypothesis class H {\displaystyle {\mathcal {H}}} if there exists an efficient Occam algorithm for C {\displaystyle {\mathcal {C}}} using H . {\displaystyle {\mathcal {H}}.} == The relation between Occam and PAC learning == Occam learnability implies PAC learnability, as the following theorem of Blumer, et al. shows: === Theorem (Occam learning implies PAC learning) === Let L {\displaystyle L} be an efficient ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm for C {\displaystyle {\mathcal {C}}} using H {\displaystyle {\mathcal {H}}} . Then there exists a constant a > 0 {\displaystyle a>0} such that for any 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , for any distribution D {\displaystyle {\mathcal {D}}} , given m ≥ a ( 1 ϵ log ⁡ 1 δ + ( ( n ⋅ s i z e ( c ) ) α ϵ ) 1 1 − β ) {\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha }}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)} samples drawn from D {\displaystyle {\mathcal {D}}} and labelled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} of length n {\displaystyle n} bits each, the algorithm L {\displaystyle L} will output a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} such that e r r o r ( h ) ≤ ϵ {\displaystyle error(h)\leq \epsilon } with probability at least 1 − δ {\displaystyle 1-\delta } .Here, e r r o r ( h ) {\displaystyle error(h)} is with respect to the concept c {\displaystyle c} and distribution D {\displaystyle {\mathcal {D}}} . This implies that the algorithm L {\displaystyle L} is also a PAC learner for the concept class C {\displaystyle {\mathcal {C}}} using hypothesis class H {\displaystyle {\mathcal {H}}} . A slightly more general formulation is as follows: === Theorem (Occam learning implies PAC learning, cardinality version) === Let 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} . Let L {\displaystyle L} be an algorithm such that, given m {\displaystyle m} samples drawn from a fixed but unknown distribution D {\displaystyle {\mathcal {D}}} and labeled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} of length n {\displaystyle n} bits each, outputs a hypothesis h ∈ H n , m {\displaystyle h\in {\mathcal {H}}_{n,m}} that is consistent with the labeled samples. Then, there exists a constant b {\displaystyle b} such that if log ⁡ | H n , m | ≤ b ϵ m − log ⁡ 1 δ {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq b\epsilon m-\log {\frac {1}{\delta }}} , then L {\displaystyle L} is guaranteed to output a hypothesis h ∈ H n , m {\displaystyle h\in {\mathcal {H}}_{n,m}} such that e r r o r ( h ) ≤ ϵ {\displaystyle error(h)\leq \epsilon } with probability at least 1 − δ {\displaystyle 1-\delta } . While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about necessity. Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning. They proved that for any concept class that is polynomially closed under exception lists, PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits, deterministic finite automata, decision-lists, decision-trees, and other geometrically defined concept classes. A concept class C {\displaystyle {\mathcal {C}}} is polynomially closed under exception lists if there exists a polynomial-time algorithm A {\displaystyle A} such that, when given the representation of a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} and a finite list E {\displaystyle E} of exceptions, outputs a representation of a concept c ′ ∈ C {\displaystyle c'\in {\mathcal {C}}} such that the concepts c {\displaystyle c} and c ′ {\displaystyle c'} agree except on the set E {\displaystyle E} . == Proof that Occam learning implies PAC learning == We first prove the Cardinality version. Call a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} bad if e r r o r ( h ) ≥ ϵ {\displaystyle error(h)\geq \epsilon } , where again e r r o r ( h ) {\displaystyle error(h)} is with respect to the true concept c {\displaystyle c} and the underlying distribution D {\displaystyle {\mathcal {D}}} . The probability that a set of samples S {\displaystyle S} is consistent with h {\displaystyle h} is at most ( 1 − ϵ ) m {\displaystyle (1-\epsilon )^{m}} , by the independence of the samples. By the union bound, the probability that there exists a bad hypothesis in H n , m {\displaystyle {\mathcal {H}}_{n,m}} is at most | H n , m | ( 1 − ϵ ) m {\displaystyle |{\mathcal {H}}_{n,m}|(1-\epsilon )^{m}} , which is less than δ {\displaystyle \delta } if log ⁡ | H n , m | ≤ O ( ϵ m ) − log ⁡ 1 δ {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq O(\epsilon m)-\log {\frac {1}{\delta }}} . This concludes the proof of the second theorem above. Using the second theorem, we can prove the first theorem. Since we have a ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm, this means that any hypothesis output by L {\displaystyle L} can be represented by at most ( n ⋅ s i z e ( c ) ) α m β {\displaystyle (n\cdot size(c))^{\alpha }m^{\beta }} bits, and thus log ⁡ | H n , m | ≤ ( n ⋅ s i z e ( c ) ) α m β {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq (n\cdot size(c))^{\alpha }m^{\beta }} . This is less than O ( ϵ m ) − log ⁡ 1 δ {\displaystyle O(\epsilon m)-\log {\frac {1}{\delta }}} if we set m ≥ a ( 1 ϵ log ⁡ 1 δ + ( ( n ⋅ s i z e ( c ) ) α ) ϵ ) 1 1 − β ) {\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha })}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)} for some constant a > 0 {\displaystyle a>0} . Thus, by the Cardinality version Theorem, L {\displaystyle L} will output a consistent hypothesis h {\displaystyle h} with probability at least 1 − δ {\displaystyle 1-\delta } . This concludes the proof of the first theorem above. == Improving sample complexity for common problems == Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions, co

    Read more →
  • Transkribus

    Transkribus

    Transkribus is a platform for the text recognition, image analysis and structure recognition of historical documents. The platform was created in the context of the two EU projects "tranScriptorium" (2013–2015) and "READ" (Recognition and Enrichment of Archival Documents – 2016–2019). It was developed by the University of Innsbruck. Since July 1, 2019 the platform has been directed and further developed by the READ-COOP, a non-profit cooperative. The platform integrates tools developed by research groups throughout Europe, including the Pattern Recognition and Human Language Technology (PRHLT) group of the Technical University of Valencia and the Computational Intelligence Technology Lab (CITlab) group of University of Rostock. Comparable programs that offer similar functions are eScriptorium and OCR4All.

    Read more →
  • Conference on Computer Vision and Pattern Recognition

    Conference on Computer Vision and Pattern Recognition

    The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.

    Read more →
  • Case-based reasoning

    Case-based reasoning

    Case-based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions of similar past problems. In everyday life, an auto mechanic who fixes an engine by recalling another car that exhibited similar symptoms is using case-based reasoning. A lawyer who advocates a particular outcome in a trial based on legal precedents or a judge who creates case law is using case-based reasoning. So, too, an engineer copying working elements of nature (practicing biomimicry) is treating nature as a database of solutions to problems. Case-based reasoning is a prominent type of analogy solution making. It has been argued that case-based reasoning is not only a powerful method for computer reasoning, but also a pervasive behavior in everyday human problem solving; or, more radically, that all reasoning is based on past cases personally experienced. This view is related to prototype theory, which is most deeply explored in cognitive science. == Process == Case-based reasoning has been formalized for purposes of computer reasoning as a four-step process: Retrieve: Given a target problem, retrieve cases relevant to solving it from memory. A case consists of a problem, its solution, and, typically, annotations about how the solution was derived. For example, suppose Fred wants to prepare blueberry pancakes. Being a novice cook, the most relevant experience he can recall is one in which he successfully made plain pancakes. The procedure he followed for making the plain pancakes, together with justifications for decisions made along the way, constitutes Fred's retrieved case. Reuse: Map the solution from the previous case to the target problem. This may involve adapting the solution as needed to fit the new situation. In the pancake example, Fred must adapt his retrieved solution to include the addition of blueberries. Revise: Having mapped the previous solution to the target situation, test the new solution in the real world (or a simulation) and, if necessary, revise. Suppose Fred adapted his pancake solution by adding blueberries to the batter. After mixing, he discovers that the batter has turned blue – an undesired effect. This suggests the following revision: delay the addition of blueberries until after the batter has been ladled into the pan. Retain: After the solution has been successfully adapted to the target problem, store the resulting experience as a new case in memory. Fred, accordingly, records his new-found procedure for making blueberry pancakes, thereby enriching his set of stored experiences, and better preparing him for future pancake-making demands. == Comparison to other methods == At first glance, CBR may seem similar to the rule induction algorithms of machine learning. Like a rule-induction algorithm, CBR starts with a set of cases or training examples; it forms generalizations of these examples, albeit implicit ones, by identifying commonalities between a retrieved case and the target problem. If for instance a procedure for plain pancakes is mapped to blueberry pancakes, a decision is made to use the same basic batter and frying method, thus implicitly generalizing the set of situations under which the batter and frying method can be used. The key difference, however, between the implicit generalization in CBR and the generalization in rule induction lies in when the generalization is made. A rule-induction algorithm draws its generalizations from a set of training examples before the target problem is even known; that is, it performs eager generalization. For instance, if a rule-induction algorithm were given recipes for plain pancakes, Dutch apple pancakes, and banana pancakes as its training examples, it would have to derive, at training time, a set of general rules for making all types of pancakes. It would not be until testing time that it would be given, say, the task of cooking blueberry pancakes. The difficulty for the rule-induction algorithm is in anticipating the different directions in which it should attempt to generalize its training examples. This is in contrast to CBR, which delays (implicit) generalization of its cases until testing time – a strategy of lazy generalization. In the pancake example, CBR has already been given the target problem of cooking blueberry pancakes; thus it can generalize its cases exactly as needed to cover this situation. CBR therefore tends to be a good approach for rich, complex domains in which there are myriad ways to generalize a case. In law, there is often explicit delegation of CBR to courts, recognizing the limits of rule based reasons: limiting delay, limited knowledge of future context, limit of negotiated agreement, etc. While CBR in law and cognitively inspired CBR have long been associated, the former is more clearly an interpolation of rule based reasoning, and judgment, while the latter is more closely tied to recall and process adaptation. The difference is clear in their attitude toward error and appellate review. Another name for case-based reasoning in problem solving is symptomatic strategies. It does require à priori domain knowledge that is gleaned from past experience which established connections between symptoms and causes. This knowledge is referred to as shallow, compiled, evidential, history-based as well as case-based knowledge. This is the strategy most associated with diagnosis by experts. Diagnosis of a problem transpires as a rapid recognition process in which symptoms evoke appropriate situation categories. An expert knows the cause by virtue of having previously encountered similar cases. Case-based reasoning is the most powerful strategy, and that used most commonly. However, the strategy won't work independently with truly novel problems, or where deeper understanding of whatever is taking place is sought. An alternative approach to problem solving is the topographic strategy which falls into the category of deep reasoning. With deep reasoning, in-depth knowledge of a system is used. Topography in this context means a description or an analysis of a structured entity, showing the relations among its elements. Also known as reasoning from first principles, deep reasoning is applied to novel faults when experience-based approaches aren't viable. The topographic strategy is therefore linked to à priori domain knowledge that is developed from a more a fundamental understanding of a system, possibly using first-principles knowledge. Such knowledge is referred to as deep, causal or model-based knowledge. Hoc and Carlier noted that symptomatic approaches may need to be supported by topographic approaches because symptoms can be defined in diverse terms. The converse is also true – shallow reasoning can be used abductively to generate causal hypotheses, and deductively to evaluate those hypotheses, in a topographical search. == Criticism == Critics of CBR argue that it is an approach that accepts anecdotal evidence as its main operating principle. Without statistically relevant data for backing and implicit generalization, there is no guarantee that the generalization is correct. However, all inductive reasoning where data is too scarce for statistical relevance is inherently based on anecdotal evidence. == History == CBR traces its roots to the work of Roger Schank and his students at Yale University in the early 1980s. Schank's model of dynamic memory was the basis for the earliest CBR systems: Janet Kolodner's CYRUS and Michael Lebowitz's IPP. Other schools of CBR and closely allied fields emerged in the 1980s, which directed at topics such as legal reasoning, memory-based reasoning (a way of reasoning from examples on massively parallel machines), and combinations of CBR with other reasoning methods. In the 1990s, interest in CBR grew internationally, as evidenced by the establishment of an International Conference on Case-Based Reasoning in 1995, as well as European, German, British, Italian, and other CBR workshops. CBR technology has resulted in the deployment of a number of successful systems, the earliest being Lockheed's CLAVIER, a system for laying out composite parts to be baked in an industrial convection oven. CBR has been used extensively in applications such as the Compaq SMART system and has found a major application area in the health sciences, as well as in structural safety management. There is recent work that develops CBR within a statistical framework and formalizes case-based inference as a specific type of probabilistic inference. Thus, it becomes possible to produce case-based predictions equipped with a certain level of confidence. One description of the difference between CBR and induction from instances is that statistical inference aims to find what tends to make cases similar while CBR aims to encode what suffices to claim similarly.

    Read more →
  • Sparse PCA

    Sparse PCA

    Sparse principal component analysis (SPCA or sparse PCA) is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables. A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings, are equal to zero. The number of nonzero loadings is called the cardinality of the SPC. == Mathematical formulation == Consider a data matrix, X {\displaystyle X} , where each of the p {\displaystyle p} columns represent an input variable, and each of the n {\displaystyle n} rows represents an independent sample from data population. One assumes each column of X {\displaystyle X} has mean zero, otherwise one can subtract column-wise mean from each element of X {\displaystyle X} . Let Σ = 1 n − 1 X ⊤ X {\displaystyle \Sigma ={\frac {1}{n-1}}X^{\top }X} be the empirical covariance matrix of X {\displaystyle X} , which has dimension p × p {\displaystyle p\times p} . Given an integer k {\displaystyle k} with 1 ≤ k ≤ p {\displaystyle 1\leq k\leq p} , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector v ∈ R p {\displaystyle v\in \mathbb {R} ^{p}} while constraining its cardinality: max v T Σ v subject to ‖ v ‖ 2 = 1 ‖ v ‖ 0 ≤ k . {\displaystyle {\begin{aligned}\max \quad &v^{T}\Sigma v\\{\text{subject to}}\quad &\left\Vert v\right\Vert _{2}=1\\&\left\Vert v\right\Vert _{0}\leq k.\end{aligned}}} Eq. 1 The first constraint specifies that v is a unit vector. In the second constraint, ‖ v ‖ 0 {\displaystyle \left\Vert v\right\Vert _{0}} represents the ℓ 0 {\displaystyle \ell _{0}} pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue. If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ. After finding the optimal solution v, one deflates Σ to obtain a new matrix Σ 1 = Σ − ( v T Σ v ) v v T , {\displaystyle \Sigma _{1}=\Sigma -(v^{T}\Sigma v)vv^{T},} and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced. The following equivalent definition is in matrix form. Let V {\displaystyle V} be a p×p symmetric matrix, one can rewrite the sparse PCA problem as max T r ( Σ V ) subject to T r ( V ) = 1 ‖ V ‖ 0 ≤ k 2 R a n k ( V ) = 1 , V ⪰ 0. {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\Vert V\Vert _{0}\leq k^{2}\\&Rank(V)=1,V\succeq 0.\end{aligned}}} Eq. 2 Tr is the matrix trace, and ‖ V ‖ 0 {\displaystyle \Vert V\Vert _{0}} represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has V = v v T {\displaystyle V=vv^{T}} , so Eq. 2 is equivalent to Eq. 1. Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program max T r ( Σ V ) subject to T r ( V ) = 1 | V i , i | ≤ z i , ∀ i ∈ { 1 , . . . , p } , | V i , j | ≤ 1 2 z i , ∀ i , j ∈ { 1 , . . . , p } : i ≠ j , V ⪰ 0 , z ∈ { 0 , 1 } p , ∑ i z i ≤ k {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\vert V_{i,i}\vert \leq z_{i},\forall i\in \{1,...,p\},\vert V_{i,j}\vert \leq {\frac {1}{2}}z_{i},\forall i,j\in \{1,...,p\}:i\neq j,\\&V\succeq 0,z\in \{0,1\}^{p},\sum _{i}z_{i}\leq k\end{aligned}}} Eq. 3 Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense. == Computational considerations == As most sparse problems, variable selection in SPCA is a computationally intractable non-convex NP-hard problem, therefore greedy sub-optimal algorithms are often employed to find solutions. Note also that SPCA introduces hyperparameters quantifying in what capacity large parameter values are penalized. These might need tuning to achieve satisfactory performance, thereby adding to the total computational cost. == Algorithms for SPCA == Several alternative approaches (of Eq. 1) have been proposed, including a regression framework, a penalized matrix decomposition framework, a convex relaxation/semidefinite programming framework, a generalized power method framework an alternating maximization framework forward-backward greedy search and exact methods using branch-and-bound techniques, a certifiably optimal branch-and-bound approach Bayesian formulation framework. A certifiably optimal mixed-integer semidefinite branch-and-cut approach The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper. === Notes on Semidefinite Programming Relaxation === It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP). If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time: max T r ( Σ V ) subject to T r ( V ) = 1 1 T | V | 1 ≤ k V ⪰ 0. {\displaystyle {\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\mathbf {1} ^{T}|V|\mathbf {1} \leq k\\&V\succeq 0.\end{aligned}}} Eq. 3 In the second constraint, 1 {\displaystyle \mathbf {1} } is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V. The optimal solution V {\displaystyle V} to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, V {\displaystyle V} can be truncated to retain only the dominant eigenvector. While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates == Applications == === Financial Data Analysis === Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs. === Biology === Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis. === High-dimensional Hypothesis Testing === Contemporary datasets often have the number of input variables ( p {\displaystyle p} ) comparable with or even much larger than the number of samples ( n {\displaystyle n} ). It has been shown that if p / n {\displaystyle p/n} does not converge to zero, the classical PCA is not consistent. In other words, if we let k = p {\displaystyle k=p} in Eq. 1, then the optimal value does not converge to the largest eigenvalue of data population when the sample size n → ∞ {\displaystyle n\rightarrow \infty } , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if p ≫ n . {\displaystyle p\gg n.} The k-sparse largest eigenvalue (the optimal value of Eq. 1) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting. Consider a hypothesis test where the null hypothesis specifies that data X {\displaystyle X} are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data X {\displaystyle X} is generated from a spiked model with signal strength θ {\displaystyle \theta } : H 0 : X ∼ N ( 0 , I p ) , H 1 : X ∼ N ( 0 , I p + θ v v T ) , {\displaystyle H_{0}:X\sim N(0,I_{p}),\quad H_{1}:X\sim N(0,I_{p}+\theta vv^{T}),} where v ∈ R p {\displaystyle v\in \mathbb {R} ^{p}

    Read more →
  • Nonlinear dimensionality reduction

    Nonlinear dimensionality reduction

    Nonlinear dimensionality reduction (NLDR), also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping (either from the high-dimensional space to the low-dimensional embedding or vice versa) itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis. == Applications of NLDR == High dimensional data can be hard for machines to work with, requiring significant time and space for analysis. It also presents a challenge for humans, since it's hard to visualize or understand data in more than three dimensions. Reducing the dimensionality of a data set, while keeping its essential features relatively intact, can make algorithms more efficient and allow analysts to visualize trends and patterns. The reduced-dimensional representations of data are often referred to as "intrinsic variables". This description implies that these are the values from which the data was produced. For example, consider a dataset that contains images of a letter 'A', which has been scaled and rotated by varying amounts. Each image has 32×32 pixels. Each image can be represented as a vector of 1024 pixel values. Each row is a sample on a two-dimensional manifold in 1024-dimensional space (a Hamming space). The intrinsic dimensionality is two, because two variables (rotation and scale) were varied in order to produce the data. Information about the shape or look of a letter 'A' is not part of the intrinsic variables because it is the same in every instance. Nonlinear dimensionality reduction will discard the correlated information (the letter 'A') and recover only the varying information (rotation and scale). By comparison, if principal component analysis, which is a linear dimensionality reduction algorithm, is used to reduce this same dataset into two dimensions, the resulting values are not so well organized. This demonstrates that the high-dimensional vectors (each representing a letter 'A') that sample this manifold vary in a non-linear manner. It should be apparent, therefore, that NLDR has several applications in the field of computer-vision. For example, consider a robot that uses a camera to navigate in a closed static environment. The images obtained by that camera can be considered to be samples on a manifold in high-dimensional space, and the intrinsic variables of that manifold will represent the robot's position and orientation. Invariant manifolds are of general interest for model order reduction in dynamical systems. In particular, if there is an attracting invariant manifold in the phase space, nearby trajectories will converge onto it and stay on it indefinitely, rendering it a candidate for dimensionality reduction of the dynamical system. While such manifolds are not guaranteed to exist in general, the theory of spectral submanifolds (SSM) gives conditions for the existence of unique attracting invariant objects in a broad class of dynamical systems. Active research in NLDR seeks to unfold the observation manifolds associated with dynamical systems to develop modeling techniques. Some of the more prominent nonlinear dimensionality reduction techniques are listed below. == Important concepts == === Sammon's mapping === Sammon's mapping is one of the first and most popular NLDR techniques. === Self-organizing map === The self-organizing map (SOM, also called Kohonen map) and its probabilistic variant generative topographic mapping (GTM) use a point representation in the embedded space to form a latent variable model based on a non-linear mapping from the embedded space to the high-dimensional space. These techniques are related to work on density networks, which also are based around the same probabilistic model. === Kernel principal component analysis === Perhaps the most widely used algorithm for dimensional reduction is kernel PCA. PCA begins by computing the covariance matrix of the m × n {\displaystyle m\times n} matrix X {\displaystyle \mathbf {X} } C = 1 m ∑ i = 1 m x i x i T . {\displaystyle C={\frac {1}{m}}\sum _{i=1}^{m}{\mathbf {x} _{i}\mathbf {x} _{i}^{\mathsf {T}}}.} It then projects the data onto the first k eigenvectors of that matrix. By comparison, KPCA begins by computing the covariance matrix of the data after being transformed into a higher-dimensional space, C = 1 m ∑ i = 1 m Φ ( x i ) Φ ( x i ) T . {\displaystyle C={\frac {1}{m}}\sum _{i=1}^{m}{\Phi (\mathbf {x} _{i})\Phi (\mathbf {x} _{i})^{\mathsf {T}}}.} It then projects the transformed data onto the first k eigenvectors of that matrix, just like PCA. It uses the kernel trick to factor away much of the computation, such that the entire process can be performed without actually computing Φ ( x ) {\displaystyle \Phi (\mathbf {x} )} . Of course Φ {\displaystyle \Phi } must be chosen such that it has a known corresponding kernel. Unfortunately, it is not trivial to find a good kernel for a given problem, so KPCA does not yield good results with some problems when using standard kernels. For example, it is known to perform poorly with these kernels on the Swiss roll manifold. However, one can view certain other methods that perform well in such settings (e.g., Laplacian Eigenmaps, LLE) as special cases of kernel PCA by constructing a data-dependent kernel matrix. KPCA has an internal model, so it can be used to map points onto its embedding that were not available at training time. === Principal curves and manifolds === Principal curves and manifolds give the natural geometric framework for nonlinear dimensionality reduction and extend the geometric interpretation of PCA by explicitly constructing an embedded manifold, and by encoding using standard geometric projection onto the manifold. This approach was originally proposed by Trevor Hastie in his 1984 thesis, which he formally introduced in 1989. This idea has been explored further by many authors. How to define the "simplicity" of the manifold is problem-dependent, however, it is commonly measured by the intrinsic dimensionality and/or the smoothness of the manifold. Usually, the principal manifold is defined as a solution to an optimization problem. The objective function includes a quality of data approximation and some penalty terms for the bending of the manifold. The popular initial approximations are generated by linear PCA and Kohonen's SOM. === Laplacian eigenmaps === Laplacian eigenmaps uses spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low-dimensional manifold in a high-dimensional space. This algorithm cannot embed out-of-sample points, but techniques based on Reproducing kernel Hilbert space regularization exist for adding this capability. Such techniques can be applied to other nonlinear dimensionality reduction algorithms as well. Traditional techniques like principal component analysis do not consider the intrinsic geometry of the data. Laplacian eigenmaps builds a graph from neighborhood information of the data set. Each data point serves as a node on the graph and connectivity between nodes is governed by the proximity of neighboring points (using e.g. the k-nearest neighbor algorithm). The graph thus generated can be considered as a discrete approximation of the low-dimensional manifold in the high-dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low-dimensional space, preserving local distances. The eigenfunctions of the Laplace–Beltrami operator on the manifold serve as the embedding dimensions, since under mild conditions this operator has a countable spectrum that is a basis for square integrable functions on the manifold (compare to Fourier series on the unit circle manifold). Attempts to place Laplacian eigenmaps on solid theoretical ground have met with some success, as under certain nonrestrictive assumptions, the graph Laplacian matrix has been shown to converge to the Laplace–Beltrami operator as the number of points goes to infinity. === Isomap === Isomap is a combination of the Floyd–Warshall algorithm with classic Multidimensional Scaling (MDS). Classic MDS takes a matrix of pair-wise distances between all points and computes a position for each point. Isomap assumes that the pair-wise distances are only known between neighboring points, and uses the Floyd–Warshall algorithm to compute the pair-wise distances between all other points. This effectively estimates the full matrix of pair-wise geodesic distances between all of the points. Isomap th

    Read more →
  • Generalized multidimensional scaling

    Generalized multidimensional scaling

    Generalized multidimensional scaling (GMDS) is an extension of metric multidimensional scaling, in which the target space is non-Euclidean. When the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another. GMDS is an emerging research direction. Currently, main applications are recognition of deformable objects (e.g. for three-dimensional face recognition) and texture mapping.

    Read more →
  • Alias Eclipse

    Alias Eclipse

    Eclipse was a professional 2D image editing program available on Silicon Graphics and Windows workstations. Designed to manipulate high-resolution images like digitized movie frames and photographs for print, it offered color correction tools, image processing effects, rudimentary paint features, and spline-based drawing and masking. == History == Eclipse was originally developed in the late 1980s by Full Color Computing, an early provider of photo retouch and color prepress software for Silicon Graphics workstations. Alias Research (later Alias Systems Corporation), a developer of professional 3D graphics applications for the SGI platform, purchased the rights to Eclipse in fall 1990. Alias developed Eclipse through the early to mid-1990s, releasing version 2.5 in 1995 with improvements to the speed of color correction, effects, and rendering. Xyvision's Contex Prepress division purchased exclusive rights to Eclipse from Alias in 1996, and released version 3.0 the following year. Eclipse was subsequently sold to German developer Form & Vision GmbH, which continued development and ported it to the Windows platform. In 1999, Form & Vision released a demo of Eclipse 3.1.3 on the SGI platform which was limited to 1600 x 1600 pixel images, then ceased development of Eclipse on the SGI platform. Eclipse was thereafter developed exclusively for the Windows platform, culminating with version 3.1.4 in 2001. In the same year the firm went bankrupt. == Features == Eclipse was designed to work with very large images that could not be manipulated in real time on contemporary computer systems due to memory limitations, and thus allowed the user to make modifications to a lower-resolution copy of the original image in "proxy mode." Brush strokes, color corrections, and other edits were saved in proxy mode, then applied to the full-size image in post processing. This method also allowed for batch processing of a high-resolution image sequence using the edits applied to the original proxy image. Other features included color correction and separation, warping, special effects, text, and shape masking. Wavelet image compression created by LuraTech was added to Eclipse 3.1.4

    Read more →
  • Training, validation, and test data sets

    Training, validation, and test data sets

    In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and testing sets. The model is initially fit on a training data set, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training data set often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training data set and produces a result, which is then compared with the target, for each input vector in the training data set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation. Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Validation data sets can be used for regularization by early stopping (stopping training when the error on the validation data set increases, as this is a sign of over-fitting to the training data set). This simple procedure is complicated in practice by the fact that the validation data set's error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when over-fitting has truly begun. Finally, the test data set is a data set used to provide an unbiased evaluation of a model fit on the training data set. When the data in the test data set has never been used (for example in cross-validation), the test data set is called a holdout data set. The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data set was partitioned into only two subsets, the test set might be referred to as the validation set). Deciding the sizes and strategies for data set division in training, test and validation sets is very dependent on the problem and data available. == Training data set == A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier. For classification tasks, a supervised learning algorithm looks at the training data set to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out data sets (validation and test data sets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model. Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general. When a training set is continuously expanded with new data, then this is incremental learning. == Validation data set == A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model. It is sometimes also called the development set or the "dev set". An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation data set in addition to the training and test data sets. For example, if the most suitable classifier for the problem is sought, the training data set is used to train the different candidate classifiers, the validation data set is used to compare their performances and decide which one to take and, finally, the test data set is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation data set functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing. The basic process of using a validation data set for model selection (as part of training data set, validation data set, and test data set) is: Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set. An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error). == Test data set == A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. A test set is therefore a set of examples used only to assess the performance (i.e. generalization) of a specified classifier on unseen data. To do this, the model is used to predict classifications of examples in the test set. Those predictions are compared to the examples' true classifications to assess the model's accuracy. If a model fit to the training and validation data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training or validation data sets as opposed to the test data set usually points to overfitting. In the scenario where a data set has a low number of samples, it is usually partitioned into a training set and a validation data set, where the model is trained on the training set and refined using the validation set to improve accuracy, but this approach will lead to overfitting. The holdout method can also be employed, where the test set is used at the end, after training on the training set. Other techniques, such as cross-validation and bootstrapping, are used on small data sets. The bootstrap method generates numerous simulated data sets of the same size by randomly sampling with replacement from the original data, allowing the random data points to serve as test sets for evaluating model performance. Cross-validation splits the data set into multiple folds, with a single sub-fold used as test data; the model is trained on the remaining folds, and all folds are cross-validated (with results averaged and models consolidated) to estimate final model performance. Note that some sources advise against using a single split, as it can lead to overfitting as well as biased model performance estimates. For this reason, data sets are split into three partitions: training, validation and test data sets. The standard machine learning practice is to train on the training set and tune hyperparameters using the validation set, where the validation process selects the model with the lowest validation loss, which is then tested on the test data set (normally held out) to assess the final model. The holdout method for the test set reduces computation by avoiding using the test set after each epoch. The test data set should never be used for validating the training model or fine-tuning hyperparameters, as it provides an accurate and honest evaluation of the model's final performance on unseen dat

    Read more →
  • Promoter based genetic algorithm

    Promoter based genetic algorithm

    The promoter based genetic algorithm (PBGA) is a genetic algorithm for neuroevolution developed by F. Bellas and R.J. Duro in the Integrated Group for Engineering Research (GII) at the University of Coruña, in Spain. It evolves variable size feedforward artificial neural networks (ANN) that are encoded into sequences of genes for constructing a basic ANN unit. Each of these blocks is preceded by a gene promoter acting as an on/off switch that determines if that particular unit will be expressed or not. == PBGA basics == The basic unit in the PBGA is a neuron with all of its inbound connections as represented in the following figure: The genotype of a basic unit is a set of real valued weights followed by the parameters of the neuron and proceeded by an integer valued field that determines the promoter gene value and, consequently, the expression of the unit. By concatenating units of this type we can construct the whole network. With this encoding it is imposed that the information that is not expressed is still carried by the genotype in evolution but it is shielded from direct selective pressure, maintaining this way the diversity in the population, which has been a design premise for this algorithm. Therefore, a clear difference is established between the search space and the solution space, permitting information learned and encoded into the genotypic representation to be preserved by disabling promoter genes. == Results == The PBGA was originally presented within the field of autonomous robotics, in particular in the real time learning of environment models of the robot. It has been used inside the Multilevel Darwinist Brain (MDB) cognitive mechanism developed in the GII for real robots on-line learning. In another paper it is shown how the application of the PBGA together with an external memory that stores the successful obtained world models, is an optimal strategy for adaptation in dynamic environments. Recently, the PBGA has provided results that outperform other neuroevolutionary algorithms in non-stationary problems, where the fitness function varies in time.

    Read more →
  • Information Harvesting

    Information Harvesting

    Information Harvesting (IH) was an early data mining product from the 1990s. It was invented by Ralphe Wiggins and produced by the Ryan Corp, later Information Harvesting Inc., of Cambridge, Massachusetts. Wiggins had a background in genetic algorithms and fuzzy logic. IH sought to infer rules from sets of data. It did this first by classifying various input variables into one of a number of bins, thereby putting some structure on the continuous variables in the input. IH then proceeds to generate rules, trading off generalization against memorization, that will infer the value of the prediction variable, possibly creating many levels of rules in the process. It included strategies for checking if overfitting took place and, if so, correcting for it. Because of its strategies for correcting for overfitting by considering more data, and refining the rules based on that data, IH might also be considered to be a form of machine learning. The advantage of IH, as compared with other data mining products of its time and even later, was that it provided a mechanism for finding multiple rules that would classify the data and determining, according to set criteria, the best rules to use.

    Read more →
  • Data Science and Predictive Analytics

    Data Science and Predictive Analytics

    The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. The second edition of the book was printed in 2023. This textbook covers some of the core mathematical foundations, computational techniques, and artificial intelligence approaches used in data science research and applications. By using the statistical computing platform R and a broad range of biomedical case-studies, the 23 chapters of the book first edition provide explicit examples of importing, exporting, processing, modeling, visualizing, and interpreting large, multivariate, incomplete, heterogeneous, longitudinal, and incomplete datasets (big data). == Structure == === First edition table of contents === The first edition of the Data Science and Predictive Analytics (DSPA) textbook is divided into the following 23 chapters, each progressively building on the previous content. === Second edition table of contents === The significantly reorganized revised edition of the book (2023) expands and modernizes the presented mathematical principles, computational methods, data science techniques, model-based machine learning and model-free artificial intelligence algorithms. The 14 chapters of the new edition start with an introduction and progressively build foundational skills to naturally reach biomedical applications of deep learning. Introduction Basic Visualization and Exploratory Data Analytics Linear Algebra, Matrix Computing, and Regression Modeling Linear and Nonlinear Dimensionality Reduction Supervised Classification Black Box Machine Learning Methods Qualitative Learning Methods—Text Mining, Natural Language Processing, and Apriori Association Rules Learning Unsupervised Clustering Model Performance Assessment, Validation, and Improvement Specialized Machine Learning Topics Variable Importance and Feature Selection Big Longitudinal Data Analysis Function Optimization Deep Learning, Neural Networks == Reception == The materials in the Data Science and Predictive Analytics (DSPA) textbook have been peer-reviewed in the Journal of the American Statistical Association, International Statistical Institute’s ISI Review Journal, and the Journal of the American Library Association. Many scholarly publications reference the DSPA textbook. As of January 17, 2021, the electronic version of the book first edition (ISBN 978-3-319-72347-1) is freely available on SpringerLink and has been downloaded over 6 million times. The textbook is globally available in print (hardcover and softcover) and electronic formats (PDF and EPub) in many college and university libraries and has been used for data science, computational statistics, and analytics classes at various institutions.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →
  • Optical character recognition

    Optical character recognition

    Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. == History == Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM. === Visually impaired users === In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. With the advent of smartphones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract the text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. == Applications == OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: Entering data for business documents, e.g. checks, passports, invoices, bank statements and receipts Automatic number-plate recognition Passport recognition and information extraction in airports Automatically extracting key information from insurance documents Traffic-sign recognition Extracting business card information into a contact list Creating textual versions of printed documents, e.g. book scanning for Project Gutenberg Making electronic images of printed documents searchable, e.g. Google Books Converting handwriting in real-time to control a computer (pen computing) Defeating or testing the robustness of CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. Assistive technology for blind and visually impaired users Writing instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time Making scanned documents searchable by converting them to PDFs == Types == Optical character recognition (OCR) – targets typewritten text, one glyph or character at a time. Optical word recognition – targets typewritten text, one word at a time (for languages that use a space as a word divider). Usually just called "OCR". Intelligent character recognition (ICR) – also targets handwritten printscript or cursive text one glyph or character at a time, usually involving machine learning. Intelligent word recognition (IWR) – also targets handwritten printscript or cursive text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition. Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". == Techniques == === Pre-processing === OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: De-skewing – if the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. Despeckling – removal of positive and negative spots, smoothing edges Binarization – conversion of an image from color or greyscale to black-and-white (called a binary image because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, scene text image, degraded historical document, etc.). Line removal – Cleaning up non-glyph boxes and lines Layout analysis or zoning – Identification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. Line and word detection – Establishment of a baseline for word and character shapes, separating words as necessary. Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. Character isolation or segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Normalization of aspect ratio and scale Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace bet

    Read more →