AI Chat Free No Limit

AI Chat Free No Limit — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • LRE Map

    LRE Map

    The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map". The LRE Map is intended to be an instrument for collecting information about language resources and to become, at the same time, a community for users, a place to share and discover resources, discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools. The large amount of information contained in the Map can be analyzed in many different ways. For instance, the LRE Map can provide information about the most frequent type of resource, the most represented language, the applications for which resources are used or are being developed, the proportion of new resources vs. already existing ones, or the way in which resources are distributed to the community. == Context == Several institutions worldwide maintain catalogues of language resources (ELRA, LDC, NICT Universal Catalogue, ACL Data and Code Repository, OLAC, LT World, etc.) However, it has been estimated that only 10% of existing resources are known, either through distribution catalogues or via direct publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it briefly emerges being when a resource is presented in the context of a research paper or report at some conference. Even in this case, nevertheless, it might be that a resource remains in the background simply because the focus of the research is not on the resource per se. == History == The LRE Map originated under the name "LREC Map" during the preparation of LREC 2010 conference. More specifically, the idea was discussed within the FlaReNet project, and in collaboration with ELRA and the Institute of Computational Linguistics of CNR in Pisa, the Map was put in place at LREC 2010. The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map. The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011, LREC 2012, LREC 2014 and LREC 2016. After this generalization to other conferences, the LREC Map has been renamed as the LRE Map. == Size and content == The size of the database increases over time. The data collected amount to 4776 entries. Each resource is described according to the following attributes: Resource type, e.g. lexicon, annotation tool, tagger/parser. Resource production status, e.g. newly created finished, existing-updated. Resource availability, e.g. freely available, from data center. Resource modality, e.g. speech, written, sign language. Resource use, e.g. named entity recognition, language identification, machine translation. Resource language, e.g. English, 23 European Union languages, official languages of India. == Uses == The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts. The map has a great potential for many uses, in addition to being an information gathering tool: It is a great instrument for monitoring the evolution of the field (useful for funders), if applied in different contexts and times. It can be seen as a huge joint effort, the beginning of an even larger cooperative action not just among few leaders but among all the researchers. It is also an "educational" means towards the broad acknowledgment of the need of meta-research activities with the active involvement of many. It is also instrumental in introducing the new notion of "citation of resources" that could provide an award and a means of scholarly recognition for researchers engaged in resource creation. It is used to help the organization of the conferences of the field like LREC. == Derived matrices == The data were then cleaned and sorted by Joseph Mariani (CNRS-LIMSI IMMI) and Gil Francopoulo (CNRS-LIMSI IMMI + Tagmatica) in order to compute the various matrices of the final FLaReNet reports. One of them, the matrix for written data at LREC 2010 is as follows: English is the most studied language. Secondly, come French and German languages and then Italian and Spanish. == Future == The LRE Map has been extended to Language Resources and Evaluation Journal and other conferences.

    Read more →
  • Yooreeka

    Yooreeka

    Yooreeka is a library for data mining, machine learning, soft computing, and mathematical analysis. The project started with the code of the book "Algorithms of the Intelligent Web". Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. It covers all major algorithms and provides many examples. Yooreeka 2.x is licensed under the Apache License rather than the somewhat more restrictive LGPL (which was the license of v1.x). The library is written 100% in the Java language. == Algorithms == The following algorithms are covered: Clustering Hierarchical—Agglomerative (e.g. MST single link; ROCK) and Divisive Partitional (e.g. k-means) Classification Bayesian Decision trees Neural Networks Rule based (via Drools) Recommendations Collaborative filtering Content based Search PageRank DocRank Personalization

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Radial basis function network

    Radial basis function network

    In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment. == Network architecture == Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer. The input can be modeled as a vector of real numbers x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} . The output of the network is then a scalar function of the input vector, φ : R n → R {\displaystyle \varphi :\mathbb {R} ^{n}\to \mathbb {R} } , and is given by φ ( x ) = ∑ i = 1 N a i ρ ( | | x − c i | | ) {\displaystyle \varphi (\mathbf {x} )=\sum _{i=1}^{N}a_{i}\rho (||\mathbf {x} -\mathbf {c} _{i}||)} where N {\displaystyle N} is the number of neurons in the hidden layer, c i {\displaystyle \mathbf {c} _{i}} is the center vector for neuron i {\displaystyle i} , and a i {\displaystyle a_{i}} is the weight of neuron i {\displaystyle i} in the linear output neuron. Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name radial basis function. In the basic form, all inputs are connected to each hidden neuron. The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance appears to perform better with pattern recognition) and the radial basis function is commonly taken to be Gaussian ρ ( ‖ x − c i ‖ ) = exp ⁡ [ − β i ‖ x − c i ‖ 2 ] {\displaystyle \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}=\exp \left[-\beta _{i}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert ^{2}\right]} . The Gaussian basis functions are local to the center vector in the sense that lim | | x | | → ∞ ρ ( ‖ x − c i ‖ ) = 0 {\displaystyle \lim _{||x||\to \infty }\rho (\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert )=0} i.e. changing parameters of one neuron has only a small effect for input values that are far away from the center of that neuron. Given certain mild conditions on the shape of the activation function, RBF networks are universal approximators on a compact subset of R n {\displaystyle \mathbb {R} ^{n}} . This means that an RBF network with enough hidden neurons can approximate any continuous function on a closed, bounded set with arbitrary precision. The parameters a i {\displaystyle a_{i}} , c i {\displaystyle \mathbf {c} _{i}} , and β i {\displaystyle \beta _{i}} are determined in a manner that optimizes the fit between φ {\displaystyle \varphi } and the data. === Normalization === ==== Normalized architecture ==== In addition to the above unnormalized architecture, RBF networks can be normalized. In this case the mapping is φ ( x ) = d e f ∑ i = 1 N a i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N a i u ( ‖ x − c i ‖ ) {\displaystyle \varphi (\mathbf {x} )\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\sum _{i=1}^{N}a_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}a_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} where u ( ‖ x − c i ‖ ) = d e f ρ ( ‖ x − c i ‖ ) ∑ j = 1 N ρ ( ‖ x − c j ‖ ) {\displaystyle u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{j=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{j}\right\Vert {\big )}}}} is known as a normalized radial basis function. ==== Theoretical motivation for normalization ==== There is theoretical justification for this architecture in the case of stochastic data flow. Assume a stochastic kernel approximation for the joint probability density P ( x ∧ y ) = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) σ ( | y − e i | ) {\displaystyle P\left(\mathbf {x} \land y\right)={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,\sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}} where the weights c i {\displaystyle \mathbf {c} _{i}} and e i {\displaystyle e_{i}} are exemplars from the data and we require the kernels to be normalized ∫ ρ ( ‖ x − c i ‖ ) d n x = 1 {\displaystyle \int \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,d^{n}\mathbf {x} =1} and ∫ σ ( | y − e i | ) d y = 1 {\displaystyle \int \sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}\,dy=1} . The probability densities in the input and output spaces are P ( x ) = ∫ P ( x ∧ y ) d y = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) {\displaystyle P\left(\mathbf {x} \right)=\int P\left(\mathbf {x} \land y\right)\,dy={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and The expectation of y given an input x {\displaystyle \mathbf {x} } is φ ( x ) = d e f E ( y ∣ x ) = ∫ y P ( y ∣ x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)\ {\stackrel {\mathrm {def} }{=}}\ E\left(y\mid \mathbf {x} \right)=\int y\,P\left(y\mid \mathbf {x} \right)dy} where P ( y ∣ x ) {\displaystyle P\left(y\mid \mathbf {x} \right)} is the conditional probability of y given x {\displaystyle \mathbf {x} } . The conditional probability is related to the joint probability through Bayes' theorem P ( y ∣ x ) = P ( x ∧ y ) P ( x ) {\displaystyle P\left(y\mid \mathbf {x} \right)={\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}} which yields φ ( x ) = ∫ y P ( x ∧ y ) P ( x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)=\int y\,{\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}\,dy} . This becomes φ ( x ) = ∑ i = 1 N e i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N e i u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)={\frac {\sum _{i=1}^{N}e_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}e_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} when the integrations are performed. === Local linear models === It is sometimes convenient to expand the architecture to include local linear models. In that case the architectures become, to first order, φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) ρ ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} in the unnormalized and normalized cases, respectively. Here b i {\displaystyle \mathbf {b} _{i}} are weights to be determined. Higher order linear terms are also possible. This result can be written φ ( x ) = ∑ i = 1 2 N ∑ j = 1 n e i j v i j ( x − c i ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{2N}\sum _{j=1}^{n}e_{ij}v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}} where e i j = { a i , if i ∈ [ 1 , N ] b i j , if i ∈ [ N + 1 , 2 N ] {\displaystyle e_{ij}={\begin{cases}a_{i},&{\mbox{if }}i\in [1,N]\\b_{ij},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} and v i j ( x − c i ) = d e f { δ i j ρ ( ‖ x − c i ‖ ) , if i ∈ [ 1 , N ] ( x i j − c i j ) ρ ( ‖ x − c i ‖ ) , if i ∈ [ N + 1 , 2 N ] {\displaystyle v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\begin{cases}\delta _{ij}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [1,N]\\\left(x_{ij}-c_{ij}\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} in the unnormalized case and in the normalized case. Here δ i j {\displaystyle \delta _{ij}} is a Kronecker delta function defined as δ i j = { 1 , if i = j 0 , if i ≠ j {\displaystyle \delta _{ij}={\begin{cases}1,&{\mbox{if }}i=j\\0,&{\mbox{if }}i\neq j\end{cases}}} . == Training == RBF networks are typically trained from pairs of input and target values x ( t ) , y ( t ) {\displaystyle \mathbf {x} (t),y(t)} , t = 1 , … , T {\displaystyle t=1,\dots ,T} by a two-step algorithm. In the first step, the center vectors c i {\displaystyle \mathbf {c} _{i}} of the RBF functions in the hidden layer

    Read more →
  • Zamzar

    Zamzar

    Zamzar is an online file converter and compressor, created by brothers Mike and Chris Whyley in England in 2006. It allows users to convert files online, without downloading a software tool, and supports over 1,200 different conversion types. Since its formation, the service has converted over 510 million files for users from 245 different countries. The service supports the conversion of documents, images, audio, video, e-Books, CAD files and compressed file formats. Users can type in a URL or upload one or more files (if they are all of the same format) from their computer; Zamzar will then convert the file(s) to another user-specified format, such as an Adobe PDF file to a Microsoft Word document. Once conversion is complete, users can immediately download the file from their web browser. Users can also choose to receive an email with a link to download the converted file. In February 2021 Zamzar expanded their tool and announced a new file compression service. The compressor is visually similar to the conversion tool with a drag and drop download feature. As with the converter, users have the option to subscribe for a paid plan if they wish to compress multiple or larger files than the free service permits == File conversion API == in 2015 Zamzar launched a file conversion API, allowing users to integrate file conversion capabilities into their own websites and applications. Sample code is provided to allow users to integrate file conversion capabilities in C#, Java, Node.js, PHP, Python and cURL. Zamzar also maintains a project on GitHub which allows users to perform file conversion from the command line on Linux, MacOS or Windows systems. == Email file conversion == It is also possible to send files for conversion by emailing them to Zamzar. Zamzar launched this capability in 2012, allowing users to email files to dedicated email addresses for the file to be automatically converted to a different format. A link is then emailed back to the end user to allow them to download their converted file. == User privilege levels == Zamzar is currently free to use, but there is a limit of two conversions per hour for files up to 100MB. Users can pay a monthly subscription in order to access preferential features, such as unlimited file conversions, online file management, shorter response and queuing times and other benefits. == Name == Its name comes from Franz Kafka's The Metamorphosis. Its main character is called Gregor Samsa and it is from his surname that Zamzar is derived. The founders of the service considered three other names – Konvertieren, Khamailen and Obrogo – before settling on Zamzar.

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • NOMINATE (scaling method)

    NOMINATE (scaling method)

    NOMINATE (an acronym for nominal three-step estimation) is a multidimensional scaling application developed by US political scientists Keith T. Poole and Howard Rosenthal in the early 1980s to analyze preferential and choice data, such as legislative roll-call voting behavior. In its most well-known application, members of the US Congress are placed on a two-dimensional map, with politicians who are ideologically similar (i.e. who often vote the same) being close together. One of these two dimensions corresponds to the familiar left–right political spectrum (liberal–conservative in the United States). As computing capabilities grew, Poole and Rosenthal developed multiple iterations of their NOMINATE procedure: the original D-NOMINATE method, W-NOMINATE, and most recently DW-NOMINATE (for dynamic, weighted NOMINATE). In 2009, Poole and Rosenthal were the first recipients of the Society for Political Methodology's Best Statistical Software Award for their development of NOMINATE. In 2016, the society awarded Poole its Career Achievement Award, stating that "the modern study of the U.S. Congress would be simply unthinkable without NOMINATE legislative roll call voting scores." == Procedure == The main procedure is an application of multidimensional scaling techniques to political choice data. Though there are important technical differences between these types of NOMINATE scaling procedures, all operate under the same fundamental assumptions. First, that alternative choices can be projected on a basic, low-dimensional (often two-dimensional) Euclidean space. Second, within that space, individuals have utility functions which are bell-shaped (normally distributed), and maximized at their ideal point. Because individuals also have symmetric, single-peaked utility functions which center on their ideal point, ideal points represent individuals' most preferred outcomes. That is, individuals most desire outcomes closest their ideal point, and will choose/vote probabilistically for the closest outcome. Ideal points can be recovered from observing choices, with individuals exhibiting similar preferences placed more closely than those behaving dissimilarly. It is helpful to compare this procedure to producing maps based on driving distances between cities. For example, Los Angeles is about 1,800 miles from St. Louis; St. Louis is about 1,200 miles from Miami; and Miami is about 2,700 miles from Los Angeles. From this (dis)similarities data, any map of these three cities should place Miami far from Los Angeles, with St. Louis somewhere in between (though a bit closer to Miami than Los Angeles). Just as cities like Los Angeles and San Francisco would be clustered on a map, NOMINATE places ideologically similar legislators (e.g., liberal Senators Barbara Boxer (D-Calif.) and Al Franken (D-Minn.)) closer to each other, and farther from dissimilar legislators (e.g., conservative Senator Tom Coburn (R-Okla.)) based on the degree of agreement between their roll call voting records. At the heart of the NOMINATE procedures (and other multidimensional scaling methods, such as Poole's Optimal Classification method) are algorithms they utilize to arrange individuals and choices in low dimensional (usually two-dimensional) space. Thus, NOMINATE scores provide "maps" of legislatures. Using NOMINATE procedures to study congressional roll call voting behavior from the First Congress to the present-day, Poole and Rosenthal published Congress: A Political-Economic History of Roll Call Voting in 1997 and the revised edition Ideology and Congress in 2007. In 2009, Poole and Rosenthal were named the first recipients of the Society for Political Methodology's Best Statistical Software Award for their development of NOMINATE, a recognition conferred to "individual(s) for developing statistical software that makes a significant research contribution". In 2016, Keith T. Poole was awarded the Society for Political Methodology's Career Achievement Award. The citation for this award reads, in part, "One can say perfectly correctly, and without any hyperbole: the modern study of the U.S. Congress would be simply unthinkable without NOMINATE legislative roll call voting scores. NOMINATE has produced data that entire bodies of our discipline—and many in the press—have relied on to understand the U.S. Congress." == Dimensions == Poole and Rosenthal demonstrate that—despite the many complexities of congressional representation and politics—roll call voting in both the House and the Senate can be organized and explained by no more than two dimensions throughout the sweep of American history. The first dimension (horizontal or x-axis) is the familiar left-right (or liberal-conservative) spectrum on economic matters. The second dimension (vertical or y-axis) picks up attitudes on cross-cutting, salient issues of the day (which include or have included slavery, bimetallism, civil rights, regional, and social/lifestyle issues). Rosenthal and Poole have initially argued that the first dimension refers to socio-economic matters and the second dimension to race-relations. However, the often confusing and residual nature of the second dimension has led to the second dimension being largely ignored by other researchers. For the most part, congressional voting is uni-dimensional, with most of the variation in voting patterns explained by placement along the liberal-conservative first dimension. While the first dimension of the DW-NOMINATE score is able to predict results at 83% accuracy, the addition of the second dimension only increases accuracy to 85%. Furthermore, the second dimension only provided a significant increase in accuracy for Congresses 1-99. As late as the 1990s, the second dimension was able to measure partisan splits in abortion and gun rights issues. However, a 2017 analysis found that since 1987, the votes of the US Congress had best fit a one-dimensional model, suggesting increasing party polarization after 1987. == Interpretation of nominate scores == For illustrative purposes, consider the following plots which use W-NOMINATE scores to scale members of Congress and uses the probabilistic voting model (in which legislators farther from the "cutting line" between "yea" and "nay" outcomes become more likely to vote in the predicted manner) to illustrate some major Congressional votes in the 1990s. Some of these votes, like the House's vote on President Clinton's welfare reform package (the Personal Responsibility and Work Opportunity Act of 1996) are best modeled through the use of the first (economic liberal-conservative) dimension. On the welfare reform vote, nearly all Republicans joined the moderate-conservative bloc of House Democrats in voting for the bill, while opposition was virtually confined to the most liberal Democrats in the House. The errors (those representatives on the "wrong" side of the cutting line which separates predicted "yeas" and predicted "nays") are generally close to the cutting line, which is what we would expect. A legislator directly on the cutting line is indifferent between voting "yea" and "nay" on the measure. All members are shown on the left panel of the plot, while only errors are shown on the right panel: Economic ideology also dominates the Senate vote on the Balanced Budget Amendment of 1995: On other votes, however, a second dimension (which has recently come to represent attitudes on cultural and lifestyle issues) is important. For example, roll call votes on gun control routinely split party coalitions, with socially conservative "blue dog" Democrats joining most Republicans in opposing additional regulation and socially liberal Republicans joining most Democrats in supporting gun control. The addition of the second dimension accounts for these inter-party differences, and the cutting line is more horizontal than vertical (meaning the cleavage is found on the second dimension rather than the first dimension on these votes) This pattern was evident in the 1991 House vote to require waiting periods on handguns: == Political ideology == DW-NOMINATE scores have been used widely to describe the political ideology of political actors, political parties and political institutions. For instance, a score in the first dimension that is close to either pole means that such score is located at one of the extremes in the liberal-conservative scale. So, a score closer to 1 is described as conservative whereas a score closer to −1 can be described as liberal. Finally, a score at zero or close to zero is described as moderate. == Political polarization == Poole and Rosenthal (beginning with their 1984 article "The Polarization of American Politics") have also used NOMINATE data to show that, since the 1970s, party delegations in Congress have become ideologically homogeneous and distant from one another (a phenomenon known as "polarization"). Using DW-NOMINATE scores (which permit direct comparisons between members of different Congress

    Read more →
  • Gaussian process emulator

    Gaussian process emulator

    In statistics, Gaussian process emulator is one name for a general type of statistical model that has been used in contexts where the problem is to make maximum use of the outputs of a complicated (often non-random) computer-based simulation model. Each run of the simulation model is computationally expensive and each run is based on many different controlling inputs. The variation of the outputs of the simulation model is expected to vary reasonably smoothly with the inputs, but in an unknown way. The overall analysis involves two models: the simulation model, or "simulator", and the statistical model, or "emulator", which notionally emulates the unknown outputs from the simulator. The Gaussian process emulator model treats the problem from the viewpoint of Bayesian statistics. In this approach, even though the output of the simulation model is fixed for any given set of inputs, the actual outputs are unknown unless the computer model is run and hence can be made the subject of a Bayesian analysis. The main element of the Gaussian process emulator model is that it models the outputs as a Gaussian process on a space that is defined by the model inputs. The model includes a description of the correlation or covariance of the outputs, which enables the model to encompass the idea that differences in the output will be small if there are only small differences in the inputs.

    Read more →
  • Organoid intelligence

    Organoid intelligence

    Organoid intelligence (OI) is an emerging field of study in computer science and biology that develops and studies biological wetware computing using 3D cultures of human brain cells (or brain organoids) and brain-machine interface technologies. Such technologies may be referred to as OIs or the nervous filesystem. Organoid intelligent computer systems can be an example of biohybrid systems. == Differences with non-organic computing == As opposed to traditional non-organic silicon-based approaches, OI seeks to use lab-grown cerebral organoids to serve as "biological hardware". While these structures are still far from being able to think like a regular human brain and do not yet possess strong computing capabilities, OI research currently offers the potential to improve the understanding of brain development, learning and memory, potentially finding treatments for neurological disorders such as dementia. Thomas Hartung, a professor from Johns Hopkins University, argued in 2023 that "while silicon-based computers are certainly better with numbers, brains are better at learning." He noted that transistor density in computer chip may be approaching its limits, whereas brains, being wired differently, are more energy-efficient and can store large amounts of information. Some researchers claim that even though human brains are slower than machines at processing simple information, they are far better at processing complex information as brains can deal with fewer and more uncertain data, perform both sequential and parallel processing, being highly heterogenous, use incomplete datasets, and is said to outperform non-organic machines in decision-making. Training OIs involve the process of biological learning (BL) as opposed to machine learning (ML) for AIs. == Bioinformatics in OI == OI generates complex biological data, necessitating sophisticated methods for processing and analysis. Bioinformatics provides the tools and techniques to decipher raw data, uncovering the patterns and insights. Researchers have developed a platform named Neuroplatform for experimenting remotely with brain organoids via an API. == Intended functions == Brain-inspired computing hardware aims to emulate the structure and working principles of the brain and could be used to address current limitations in AI technologies. However, brain-inspired silicon chips are still limited in their ability to fully mimic brain function, as most examples are built on digital electronic principles. One study performed OI computation (which they termed Brainoware) by sending and receiving information from the brain organoid using a high-density multielectrode array. By applying spatiotemporal electrical stimulation, nonlinear dynamics, and fading memory properties, as well as unsupervised learning from training data by reshaping the organoid functional connectivity, the study showed the potential of this technology by using it for speech recognition and nonlinear equation prediction in a reservoir computing framework. == Ethical concerns == While researchers are hoping to use OI and biological computing to complement traditional silicon-based computing, there are also questions about the ethics of such an approach. Concerns include the possibility that an organoid could develop sentience or consciousness, and the question of the relationship between a stem cell donor (for growing the organoid) and the respective OI system.

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Low-rank approximation

    Low-rank approximation

    In mathematics, low-rank approximation refers to the process of approximating a given matrix by a matrix of lower rank. More precisely, it is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure. Low-rank approximation is closely related to numerous other techniques, including principal component analysis, factor analysis, total least squares, latent semantic analysis, orthogonal regression, and dynamic mode decomposition. == Definition == Given structure specification S : R n p → R m × n {\displaystyle {\mathcal {S}}:\mathbb {R} ^{n_{p}}\to \mathbb {R} ^{m\times n}} , vector of structure parameters p ∈ R n p {\displaystyle p\in \mathbb {R} ^{n_{p}}} , norm ‖ ⋅ ‖ {\displaystyle \|\cdot \|} , and desired rank r {\displaystyle r} , minimize over p ^ ‖ p − p ^ ‖ subject to rank ⁡ ( S ( p ^ ) ) ≤ r . {\displaystyle {\text{minimize}}\quad {\text{over }}{\widehat {p}}\quad \|p-{\widehat {p}}\|\quad {\text{subject to}}\quad \operatorname {rank} {\big (}{\mathcal {S}}({\widehat {p}}){\big )}\leq r.} == Applications == Linear system identification, in which case the approximating matrix is Hankel structured. Machine learning, in which case the approximating matrix is nonlinearly structured. Recommender systems, in which cases the data matrix has missing values and the approximation is categorical. Distance matrix completion, in which case there is a positive definiteness constraint. Natural language processing, in which case the approximation is nonnegative. Computer algebra, in which case the approximation is Sylvester structured. Matrix product states, in which case the approximation is usually rescaled to have fixed Frobenius norm. == Basic low-rank approximation problem == The unstructured problem with fit measured by the Frobenius norm, i.e., minimize over D ^ ‖ D − D ^ ‖ F subject to rank ⁡ ( D ^ ) ≤ r {\displaystyle {\text{minimize}}\quad {\text{over }}{\widehat {D}}\quad \|D-{\widehat {D}}\|_{\text{F}}\quad {\text{subject to}}\quad \operatorname {rank} {\big (}{\widehat {D}}{\big )}\leq r} has an analytic solution in terms of the singular value decomposition of the data matrix. The result is referred to as the matrix approximation lemma or Eckart–Young–Mirsky theorem. This problem was originally solved by Erhard Schmidt in the infinite dimensional context of integral operators (although his methods easily generalize to arbitrary compact operators on Hilbert spaces) and later rediscovered by C. Eckart and G. Young. L. Mirsky generalized the result to arbitrary unitarily invariant norms. Let D = U Σ V ⊤ ∈ R m × n , m ≥ n {\displaystyle D=U\Sigma V^{\top }\in \mathbb {R} ^{m\times n},\quad m\geq n} be the singular value decomposition of D {\displaystyle D} , where Σ =: diag ⁡ ( σ 1 , … , σ r ) {\displaystyle \Sigma =:\operatorname {diag} (\sigma _{1},\ldots ,\sigma _{r})} , where r ≤ min { m , n } = n {\displaystyle r\leq \min\{m,n\}=n} , is the m × n {\displaystyle m\times n} rectangular diagonal matrix with r {\displaystyle r} non-zero singular values σ 1 ≥ … ≥ σ r > σ r + 1 = … = σ n = 0 {\displaystyle \sigma _{1}\geq \ldots \geq \sigma _{r}>\sigma _{r+1}=\ldots =\sigma _{n}=0} . For a given k ∈ { 1 , … , r } {\displaystyle k\in \{1,\dots ,r\}} , partition U {\displaystyle U} , Σ {\displaystyle \Sigma } , and V {\displaystyle V} as follows: U =: [ U 1 U 2 ] , Σ =: [ Σ 1 0 0 Σ 2 ] , and V =: [ V 1 V 2 ] , {\displaystyle U=:{\begin{bmatrix}U_{1}&U_{2}\end{bmatrix}},\quad \Sigma =:{\begin{bmatrix}\Sigma _{1}&0\\0&\Sigma _{2}\end{bmatrix}},\quad {\text{and}}\quad V=:{\begin{bmatrix}V_{1}&V_{2}\end{bmatrix}},} where U 1 {\displaystyle U_{1}} is m × k {\displaystyle m\times k} , Σ 1 {\displaystyle \Sigma _{1}} is k × k {\displaystyle k\times k} , and V 1 {\displaystyle V_{1}} is n × k {\displaystyle n\times k} . Then the rank k {\displaystyle k} matrix D ^ ∗ := U 1 Σ 1 V 1 ⊤ , {\displaystyle {\widehat {D}}^{}:=U_{1}\Sigma _{1}V_{1}^{\top },} obtained from the truncated singular value decomposition is such that ‖ D − D ^ ∗ ‖ F = min rank ⁡ ( D ^ ) ≤ k ‖ D − D ^ ‖ F = σ k + 1 2 + ⋯ + σ r 2 . {\displaystyle \|D-{\widehat {D}}^{}\|_{\text{F}}=\min _{\operatorname {rank} ({\widehat {D}})\leq k}\|D-{\widehat {D}}\|_{\text{F}}={\sqrt {\sigma _{k+1}^{2}+\cdots +\sigma _{r}^{2}}}.} The minimizer D ^ ∗ {\displaystyle {\widehat {D}}^{}} is unique if and only if σ k > σ k + 1 {\displaystyle \sigma _{k}>\sigma _{k+1}} . == Proof of Eckart–Young–Mirsky theorem (for spectral norm) == Let A ∈ R m × n {\displaystyle A\in \mathbb {R} ^{m\times n}} be a real (possibly rectangular) matrix with m ≤ n {\displaystyle m\leq n} . Suppose that A = U Σ V ⊤ {\displaystyle A=U\Sigma V^{\top }} is the singular value decomposition of A {\displaystyle A} . Recall that U {\displaystyle U} and V {\displaystyle V} are orthogonal matrices, and Σ {\displaystyle \Sigma } is an m × n {\displaystyle m\times n} diagonal matrix with entries ( σ 1 , σ 2 , ⋯ , σ m ) {\displaystyle (\sigma _{1},\sigma _{2},\cdots ,\sigma _{m})} such that σ 1 ≥ σ 2 ≥ ⋯ ≥ σ m ≥ 0 {\displaystyle \sigma _{1}\geq \sigma _{2}\geq \cdots \geq \sigma _{m}\geq 0} . We claim that the best rank- k {\displaystyle k} approximation to A {\displaystyle A} in the spectral norm, denoted by ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} , is given by A k := ∑ i = 1 k σ i u i v i ⊤ {\displaystyle A_{k}:=\sum _{i=1}^{k}\sigma _{i}u_{i}v_{i}^{\top }} where u i {\displaystyle u_{i}} and v i {\displaystyle v_{i}} denote the i {\displaystyle i} th column of U {\displaystyle U} and V {\displaystyle V} , respectively. First, note that we have ‖ A − A k ‖ 2 = ‖ ∑ i = 1 n σ i u i v i ⊤ − ∑ i = 1 k σ i u i v i ⊤ ‖ 2 = ‖ ∑ i = k + 1 n σ i u i v i ⊤ ‖ 2 = σ k + 1 {\displaystyle \|A-A_{k}\|_{2}=\left\|\sum _{i=1}^{\color {red}{n}}\sigma _{i}u_{i}v_{i}^{\top }-\sum _{i=1}^{\color {red}{k}}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{2}=\left\|\sum _{i=\color {red}{k+1}}^{n}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{2}=\sigma _{k+1}} Therefore, we need to show that if B k = X Y ⊤ {\displaystyle B_{k}=XY^{\top }} where X {\displaystyle X} and Y {\displaystyle Y} have k {\displaystyle k} columns then ‖ A − A k ‖ 2 = σ k + 1 ≤ ‖ A − B k ‖ 2 {\displaystyle \|A-A_{k}\|_{2}=\sigma _{k+1}\leq \|A-B_{k}\|_{2}} . Since Y {\displaystyle Y} has k {\displaystyle k} columns, then there must be a nontrivial linear combination of the first k + 1 {\displaystyle k+1} columns of V {\displaystyle V} , i.e., w = γ 1 v 1 + ⋯ + γ k + 1 v k + 1 , {\displaystyle w=\gamma _{1}v_{1}+\cdots +\gamma _{k+1}v_{k+1},} such that Y ⊤ w = 0 {\displaystyle Y^{\top }w=0} . Without loss of generality, we can scale w {\displaystyle w} so that ‖ w ‖ 2 = 1 {\displaystyle \|w\|_{2}=1} or (equivalently) γ 1 2 + ⋯ + γ k + 1 2 = 1 {\displaystyle \gamma _{1}^{2}+\cdots +\gamma _{k+1}^{2}=1} . Therefore, ‖ A − B k ‖ 2 2 ≥ ‖ ( A − B k ) w ‖ 2 2 = ‖ A w ‖ 2 2 = γ 1 2 σ 1 2 + ⋯ + γ k + 1 2 σ k + 1 2 ≥ σ k + 1 2 . {\displaystyle \|A-B_{k}\|_{2}^{2}\geq \|(A-B_{k})w\|_{2}^{2}=\|Aw\|_{2}^{2}=\gamma _{1}^{2}\sigma _{1}^{2}+\cdots +\gamma _{k+1}^{2}\sigma _{k+1}^{2}\geq \sigma _{k+1}^{2}.} The result follows by taking the square root of both sides of the above inequality. == Proof of Eckart–Young–Mirsky theorem (for Frobenius norm) == Let A ∈ R m × n {\displaystyle A\in \mathbb {R} ^{m\times n}} be a real (possibly rectangular) matrix with m ≤ n {\displaystyle m\leq n} . Suppose that A = U Σ V ⊤ {\displaystyle A=U\Sigma V^{\top }} is the singular value decomposition of A {\displaystyle A} . We claim that the best rank k {\displaystyle k} approximation to A {\displaystyle A} in the Frobenius norm, denoted by ‖ ⋅ ‖ F {\displaystyle \|\cdot \|_{F}} , is given by A k = ∑ i = 1 k σ i u i v i ⊤ {\displaystyle A_{k}=\sum _{i=1}^{k}\sigma _{i}u_{i}v_{i}^{\top }} where u i {\displaystyle u_{i}} and v i {\displaystyle v_{i}} denote the i {\displaystyle i} th column of U {\displaystyle U} and V {\displaystyle V} , respectively. First, note that we have ‖ A − A k ‖ F 2 = ‖ ∑ i = k + 1 n σ i u i v i ⊤ ‖ F 2 = ∑ i = k + 1 n σ i 2 {\displaystyle \|A-A_{k}\|_{F}^{2}=\left\|\sum _{i=k+1}^{n}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{F}^{2}=\sum _{i=k+1}^{n}\sigma _{i}^{2}} Therefore, we need to show that if B k = X Y ⊤ {\displaystyle B_{k}=XY^{\top }} where X {\displaystyle X} and Y {\displaystyle Y} have k {\displaystyle k} columns then ‖ A − A k ‖ F 2 = ∑ i = k + 1 n σ i 2 ≤ ‖ A − B k ‖ F 2 . {\displaystyle \|A-A_{k}\|_{F}^{2}=\sum _{i=k+1}^{n}\sigma _{i}^{2}\leq \|A-B_{k}\|_{F}^{2}.} By the triangle inequality with the spectral norm

    Read more →
  • Fitness approximation

    Fitness approximation

    Fitness approximation aims to approximate the objective or fitness functions in evolutionary optimization by building up machine learning models based on data collected from numerical simulations or physical experiments. The machine learning models for fitness approximation are also known as meta-models or surrogates, and evolutionary optimization based on approximated fitness evaluations are also known as surrogate-assisted evolutionary approximation. Fitness approximation in evolutionary optimization can be seen as a sub-area of data-driven evolutionary optimization. == Approximate models in function optimization == === Motivation === In many real-world optimization problems including engineering problems, the number of fitness function evaluations needed to obtain a good solution dominates the optimization cost. In order to obtain efficient optimization algorithms, it is crucial to use prior information gained during the optimization process. Conceptually, a natural approach to utilizing the known prior information is building a model of the fitness function to assist in the selection of candidate solutions for evaluation. A variety of techniques for constructing such a model, often also referred to as surrogates, metamodels or approximation models – for computationally expensive optimization problems have been considered. === Approaches === Common approaches to constructing approximate models based on learning and interpolation from known fitness values of a small population include: Low-degree polynomials and regression models Fourier surrogate modeling Artificial neural networks including Multilayer perceptrons Radial basis function network Support vector machines Due to the limited number of training samples and high dimensionality encountered in engineering design optimization, constructing a globally valid approximate model remains difficult. As a result, evolutionary algorithms using such approximate fitness functions may converge to local optima. Therefore, it can be beneficial to selectively use the original fitness function together with the approximate model.

    Read more →
  • Image

    Image

    An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a projection on a surface, activation of electronic signals, or digital displays; they can also be reproduced through mechanical means, such as photography, printmaking, or photocopying. Images can also be animated through digital or physical processes. In the context of signal processing, an image is a distributed amplitude of color(s). In optics, the term image (or optical image) refers specifically to the reproduction of an object formed by light waves coming from the object. A volatile image exists or is perceived only for a short period. This may be a reflection of an object by a mirror, a projection of a camera obscura, or a scene displayed on a cathode-ray tube. A fixed image, also called a hard copy, is one that has been recorded on a material object, such as paper or textile. A mental image exists in an individual's mind as something one remembers or imagines. The subject of an image does not need to be real; it may be an abstract concept such as a graph or function or an imaginary entity. For a mental image to be understood outside of an individual's mind, however, there must be a way of conveying that mental image through the words or visual productions of the subject. == Characteristics == === Two-dimensional images === The broader sense of the word 'image' also encompasses any two-dimensional figure, such as a map, graph, pie chart, painting, or banner. In this wider sense, images can also be rendered manually, such as by drawing, the art of painting, or the graphic arts (such as lithography or etching). Additionally, images can be rendered automatically through printing, computer graphics technology, or a combination of both methods. A two-dimensional image does not need to use the entire visual system to be a visual representation. An example of this is a grayscale ("black and white") image, which uses the visual system's sensitivity to brightness across all wavelengths without taking into account different colors. A black-and-white visual representation of something is still an image, even though it does not fully use the visual system's capabilities. On the other hand, some processes can be used to create visual representations of objects that are otherwise inaccessible to the human visual system. These include microscopy for the magnification of minute objects, telescopes that can observe objects at great distances, X-rays that can visually represent the interior structures of the human body (among other objects), magnetic resonance imaging (MRI), positron emission tomography (PET scans), and others. Such processes often rely on detecting electromagnetic radiation that occurs beyond the light spectrum visible to the human eye and converting such signals into recognizable images. === Three-dimensional images === Aside from sculpture and other physical activities that can create three-dimensional images from solid material, some modern techniques, such as holography, can create three-dimensional images that are reproducible but intangible to human touch. Some photographic processes can now render the illusion of depth in an otherwise "flat" image, but "3-D photography" (stereoscopy) or "3-D film" are optical illusions that require special devices such as eyeglasses to create the illusion of depth. === Moving images === "Moving" two-dimensional images are actually illusions of movement perceived when still images are displayed in sequence, each image lasting less, and sometimes much less, than a fraction of a second. The traditional standard for the display of individual frames by a motion picture projector has been 24 frames per second (FPS) since at least the commercial introduction of "talking pictures" in the late 1920s, which necessitated a standard for synchronizing images and sounds. Even in electronic formats such as television and digital image displays, the apparent "motion" is actually the result of many individual lines giving the impression of continuous movement. This phenomenon has often been described as "persistence of vision": a physiological effect of light impressions remaining on the retina of the eye for very brief periods. Even though the term is still sometimes used in popular discussions of movies, it is not a scientifically valid explanation. Other terms emphasize the complex cognitive operations of the brain and the human visual system. "Flicker fusion", the "phi phenomenon", and "beta movement" are among the terms that have replaced "persistence of vision", though no one term seems adequate to describe the process. == Cultural and other uses == Image-making seems to have been common to virtually all human cultures since at least the Paleolithic era. Prehistoric examples of rock art—including cave paintings, petroglyphs, rock reliefs, and geoglyphs—have been found on every inhabited continent. Many of these images seem to have served various purposes: as a form of record-keeping; as an element of spiritual, religious, or magical practice; or even as a form of communication. Early writing systems, including hieroglyphics, ideographic writing, and even the Roman alphabet, owe their origins in some respects to pictorial representations. === Meaning and signification === Images of any type may convey different meanings and sensations for individual viewers, regardless of whether the image's creator intended them. An image may be taken simply as a more or less "accurate" copy of a person, place, thing, or event. It may represent an abstract concept, such as the political power of a ruler or ruling class, a practical or moral lesson, an object for spiritual or religious veneration, or an object—human or otherwise—to be desired. It may also be regarded for its purely aesthetic qualities, rarity, or monetary value. Such reactions can depend on the viewer's context. A religious image in a church may be regarded differently than the same image mounted in a museum. Some might view it simply as an object to be bought or sold. Viewers' reactions will also be guided or shaped by their education, class, race, and other contexts. The study of emotional sensations and their relationship to any given image falls into the categories of aesthetics and the philosophy of art. While such studies inevitably deal with issues of meaning, another approach to signification was suggested by the American philosopher, logician, and semiotician Charles Sanders Peirce. "Images" are one type of the broad category of "signs" proposed by Peirce. Although his ideas are complex and have changed over time, the three categories of signs that he distinguished stand out: The "icon," which relates to an object by resemblance to some quality of the object. A painted or photographed portrait is an icon by virtue of its resemblance to the painting's or photograph's subject. A more abstract representation, such as a map or diagram, can also be an icon. The "index," which relates to an object by some real connection. For example, smoke may be an index of fire, or the temperature recorded on a thermometer may be an index of a patient's illness or health. The "symbol," which lacks direct resemblance or connection to an object but whose association is arbitrarily assigned by the creator or dictated by cultural and historical habit, convention, etc. The color red, for example, may connote rage, beauty, prosperity, political affiliation, or other meanings within a given culture or context; the Swedish film director Ingmar Bergman claimed that his use of the color in his 1972 film Cries and Whispers came from his personal visualization of the human soul. A single image may exist in all three categories at the same time. The Statue of Liberty provides an example. While there have been countless two-dimensional and three-dimensional "reproductions" of the statue (i.e., "icons" themselves), the statue itself exists as an "icon" by virtue of its resemblance to a human woman (or, more specifically, previous representations of the Roman goddess Libertas or the female model used by the artist Frederic-Auguste Bartholdi). an "index" representing New York City or the United States of America in general due to its placement in New York Harbor, or with "immigration" from its proximity to the immigration center at Ellis Island. a "symbol" as a visualization of the abstract concept of "liberty" or "freedom" or even "opportunity" or "diversity". === Critiques of imagery === The nature of images, whether three-dimensional or two-dimensional, created for a specific purpose or only for aesthetic pleasure, has continued to provoke questions and even condemnation at different times and places. In his dialogue, The Republic, the Greek philosopher Plato described our apparent reality as a copy of a higher order of universal forms.

    Read more →
  • Dendrogram

    Dendrogram

    A dendrogram is a diagram representing a tree graph. This diagrammatic representation is frequently used in different contexts: in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses. in computational biology, it shows the clustering of genes or samples, sometimes in the margins of heatmaps. in phylogenetics, it displays the evolutionary relationships among various biological taxa. In this case, the dendrogram is also called a phylogenetic tree. The name dendrogram derives from the two ancient greek words δένδρον (déndron), meaning "tree", and γράμμα (grámma), meaning "drawing, mathematical figure". == Clustering example == For a clustering example, suppose that five taxa ( a {\displaystyle a} to e {\displaystyle e} ) have been clustered by UPGMA based on a matrix of genetic distances. The hierarchical clustering dendrogram would show a column of five nodes representing the initial data (here individual taxa), and the remaining nodes represent the clusters to which the data belong, with the arrows representing the distance (dissimilarity). The distance between merged clusters is monotone, increasing with the level of the merger: the height of each node in the plot is proportional to the value of the intergroup dissimilarity between its two daughters (the nodes on the right representing individual observations all plotted at zero height).

    Read more →
  • Distribution learning theory

    Distribution learning theory

    The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg ⁡ min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning fram

    Read more →