Reference data is data used to classify or categorize other data. Typically, they are static or slowly changing over time. Examples of reference data include: Units of measurement Country codes Corporate codes Fixed conversion rates e.g., weight, temperature, and length Calendar structure and constraints Reference data sets are sometimes alternatively referred to as a "controlled vocabulary" or "lookup" data. Reference data differs from master data. While both provide context for business transactions, reference data is concerned with classification and categorisation, while master data is concerned with business entities. A further difference between reference data and master data is that a change to the reference data values may require an associated change in business process to support the change, while a change in master data will always be managed as part of existing business processes. For example, adding a new customer or sales product is part of the standard business process. However, adding a new product classification (e.g. "restricted sales item") or a new customer type (e.g. "gold level customer") will result in a modification to the business processes to manage those items. == Externally-defined reference data == For most organisations, most or all reference data is defined and managed within that organisation. Some reference data, however, may be externally defined and managed, for example by standards organizations. An example of externally defined reference data is the set of country codes as defined in ISO 3166-1. == Reference data management == Curating and managing reference data is key to ensuring its quality and thus fitness for purpose. All aspects of an organisation, operational and analytical, are greatly dependent on the quality of an organization's reference data. Without consistency across business process or applications, for example, similar things may be described in quite different ways. Reference data gain in value when they are widely re-used and widely referenced. Examples of good practice in reference data management include: Formalize the reference data management Use external reference data as much as possible Govern the reference data specific to your enterprise Manage reference data at enterprise level Version control your reference data
Landmark point
In morphometrics, landmark point or shortly landmark is a point in a shape object in which correspondences between and within the populations of the object are preserved. In other disciplines, landmarks may be known as vertices, anchor points, control points, sites, profile points, 'sampling' points, nodes, markers, fiducial markers, etc. Landmarks can be defined either manually by experts or automatically by a computer program. There are three basic types of landmarks: anatomical landmarks, mathematical landmarks or pseudo-landmarks. An anatomical landmark is a biologically-meaningful point in an organism. Usually experts define anatomical points to ensure their correspondences within the same species. Examples of anatomical landmark in shape of a skull are the eye corner, tip of the nose, jaw, etc. Anatomical landmarks determine homologous parts of an organism, which share a common ancestry. Mathematical landmarks are points in a shape that are located according to some mathematical or geometrical property, for instance, a high curvature point or an extreme point. A computer program usually determines mathematical landmarks used for an automatic pattern recognition. Pseudo-landmarks are constructed points located between anatomical or mathematical landmarks. A typical example is an equally spaced set of points between two anatomical landmarks to get more sample points from a shape. Pseudo-landmarks are useful during shape matching, when the matching process requires a large number of points.
Tensor sketch
In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure. Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. == Mathematical definition == Mathematically, a dimensionality reduction or sketching matrix is a matrix M ∈ R k × d {\displaystyle M\in \mathbb {R} ^{k\times d}} , where k < d {\displaystyle k Averaged one-dependence estimators (AODE) is a probabilistic classification learning technique. It was developed to address the attribute-independence problem of the popular naive Bayes classifier. It frequently develops substantially more accurate classifiers than naive Bayes at the cost of a modest increase in the amount of computation. == The AODE classifier == AODE seeks to estimate the probability of each class y given a specified set of features x1, ... xn, P(y | x1, ... xn). To do so it uses the formula P ^ ( y ∣ x 1 , … x n ) = ∑ i : 1 ≤ i ≤ n ∧ F ( x i ) ≥ m P ^ ( y , x i ) ∏ j = 1 n P ^ ( x j ∣ y , x i ) ∑ y ′ ∈ Y ∑ i : 1 ≤ i ≤ n ∧ F ( x i ) ≥ m P ^ ( y ′ , x i ) ∏ j = 1 n P ^ ( x j ∣ y ′ , x i ) {\displaystyle {\hat {P}}(y\mid x_{1},\ldots x_{n})={\frac {\sum _{i:1\leq i\leq n\wedge F(x_{i})\geq m}{\hat {P}}(y,x_{i})\prod _{j=1}^{n}{\hat {P}}(x_{j}\mid y,x_{i})}{\sum _{y^{\prime }\in Y}\sum _{i:1\leq i\leq n\wedge F(x_{i})\geq m}{\hat {P}}(y^{\prime },x_{i})\prod _{j=1}^{n}{\hat {P}}(x_{j}\mid y^{\prime },x_{i})}}} where P ^ ( ⋅ ) {\displaystyle {\hat {P}}(\cdot )} denotes an estimate of P ( ⋅ ) {\displaystyle P(\cdot )} , F ( ⋅ ) {\displaystyle F(\cdot )} is the frequency with which the argument appears in the sample data and m is a user specified minimum frequency with which a term must appear in order to be used in the outer summation. In recent practice m is usually set at 1. == Derivation of the AODE classifier == We seek to estimate P(y | x1, ... xn). By the definition of conditional probability P ( y ∣ x 1 , … x n ) = P ( y , x 1 , … x n ) P ( x 1 , … x n ) . {\displaystyle P(y\mid x_{1},\ldots x_{n})={\frac {P(y,x_{1},\ldots x_{n})}{P(x_{1},\ldots x_{n})}}.} For any 1 ≤ i ≤ n {\displaystyle 1\leq i\leq n} , P ( y , x 1 , … x n ) = P ( y , x i ) P ( x 1 , … x n ∣ y , x i ) . {\displaystyle P(y,x_{1},\ldots x_{n})=P(y,x_{i})P(x_{1},\ldots x_{n}\mid y,x_{i}).} Under an assumption that x1, ... xn are independent given y and xi, it follows that P ( y , x 1 , … x n ) = P ( y , x i ) ∏ j = 1 n P ( x j ∣ y , x i ) . {\displaystyle P(y,x_{1},\ldots x_{n})=P(y,x_{i})\prod _{j=1}^{n}P(x_{j}\mid y,x_{i}).} This formula defines a special form of One Dependence Estimator (ODE), a variant of the naive Bayes classifier that makes the above independence assumption that is weaker (and hence potentially less harmful) than the naive Bayes' independence assumption. In consequence, each ODE should create a less biased estimator than naive Bayes. However, because the base probability estimates are each conditioned by two variables rather than one, they are formed from less data (the training examples that satisfy both variables) and hence are likely to have more variance. AODE reduces this variance by averaging the estimates of all such ODEs. == Features of the AODE classifier == Like naive Bayes, AODE does not perform model selection and does not use tuneable parameters. As a result, it has low variance. It supports incremental learning whereby the classifier can be updated efficiently with information from new examples as they become available. It predicts class probabilities rather than simply predicting a single class, allowing the user to determine the confidence with which each classification can be made. Its probabilistic model can directly handle situations where some data are missing. AODE has computational complexity O ( l n 2 ) {\displaystyle O(ln^{2})} at training time and O ( k n 2 ) {\displaystyle O(kn^{2})} at classification time, where n is the number of features, l is the number of training examples and k is the number of classes. This makes it infeasible for application to high-dimensional data. However, within that limitation, it is linear with respect to the number of training examples and hence can efficiently process large numbers of training examples. == Implementations == The free Weka machine learning suite includes an implementation of AODE. Neural Networks is a monthly peer-reviewed scientific journal and an official journal of the International Neural Network Society, European Neural Network Society, and Japanese Neural Network Society. == History == The journal was established in 1988 and is published by Elsevier. It covers all aspects of research on artificial neural networks. The founding editor-in-chief was Stephen Grossberg (Boston University). The current editors-in-chief are DeLiang Wang (Ohio State University) and Taro Toyoizumi (RIKEN Center for Brain Science). == Abstracting and indexing == The journal is abstracted and indexed in Scopus and the Science Citation Index Expanded. According to the Journal Citation Reports, the journal has a 2022 impact factor of 7.8. In artificial intelligence and related fields, an argumentation framework is a way to deal with contentious information and draw conclusions from it using formalized arguments. In an abstract argumentation framework, entry-level information is a set of abstract arguments that, for instance, represent data or a proposition. Conflicts between arguments are represented by a binary relation on the set of arguments. In concrete terms, an argumentation framework is represented with a directed graph such that the nodes are the arguments, and the arrows represent the attack relation. There exist some extensions of the Dung's framework, like the logic-based argumentation frameworks or the value-based argumentation frameworks. == Abstract argumentation frameworks == === Formal framework === Abstract argumentation frameworks, also called argumentation frameworks à la Dung, are defined formally as a pair: A set of abstract elements called arguments, denoted A {\displaystyle A} A binary relation on A {\displaystyle A} , called attack relation, denoted R {\displaystyle R} For instance, the argumentation system S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } with A = { a , b , c , d } {\displaystyle A=\{a,b,c,d\}} and R = { ( a , b ) , ( b , c ) , ( d , c ) } {\displaystyle R=\{(a,b),(b,c),(d,c)\}} contains four arguments ( a , b , c {\displaystyle a,b,c} and d {\displaystyle d} ) and three attacks ( a {\displaystyle a} attacks b {\displaystyle b} , b {\displaystyle b} attacks c {\displaystyle c} and d {\displaystyle d} attacks c {\displaystyle c} ). Dung defines some notions : an argument a ∈ A {\displaystyle a\in A} is acceptable with respect to E ⊆ A {\displaystyle E\subseteq A} if and only if E {\displaystyle E} defends a {\displaystyle a} , that is ∀ b ∈ A {\displaystyle \forall b\in A} such that ( b , a ) ∈ R , ∃ c ∈ E {\displaystyle (b,a)\in R,\exists c\in E} such that ( c , b ) ∈ R {\displaystyle (c,b)\in R} , a set of arguments E {\displaystyle E} is conflict-free if there is no attack between its arguments, formally : ∀ a , b ∈ E , ( a , b ) ∉ R {\displaystyle \forall a,b\in E,(a,b)\not \in R} , a set of arguments E {\displaystyle E} is admissible if and only if it is conflict-free and all its arguments are acceptable with respect to E {\displaystyle E} . === Different semantics of acceptance === ==== Extensions ==== To decide if an argument can be accepted or not, or if several arguments can be accepted together, Dung defines several semantics of acceptance that allows, given an argumentation system, sets of arguments (called extensions) to be computed. For instance, given S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } , E {\displaystyle E} is a complete extension of S {\displaystyle S} only if it is an admissible set and every acceptable argument with respect to E {\displaystyle E} belongs to E {\displaystyle E} , E {\displaystyle E} is a preferred extension of S {\displaystyle S} only if it is a maximal element (with respect to the set-theoretical inclusion) among the admissible sets with respect to S {\displaystyle S} , E {\displaystyle E} is a stable extension of S {\displaystyle S} only if it is a conflict-free set that attacks every argument that does not belong in E {\displaystyle E} (formally, ∀ a ∈ A ∖ E , ∃ b ∈ E {\displaystyle \forall a\in A\backslash E,\exists b\in E} such that ( b , a ) ∈ R {\displaystyle (b,a)\in R} , E {\displaystyle E} is the (unique) grounded extension of S {\displaystyle S} only if it is the smallest element (with respect to set inclusion) among the complete extensions of S {\displaystyle S} . There exists some inclusions between the sets of extensions built with these semantics : Every stable extension is preferred, Every preferred extension is complete, The grounded extension is complete, If the system is well-founded (there exists no infinite sequence a 0 , a 1 , … , a n , … {\displaystyle a_{0},a_{1},\dots ,a_{n},\dots } such that ∀ i > 0 , ( a i + 1 , a i ) ∈ R {\displaystyle \forall i>0,(a_{i+1},a_{i})\in R} ), all these semantics coincide—only one extension is grounded, stable, preferred, and complete. Some other semantics have been defined. One introduce the notation E x t σ ( S ) {\displaystyle Ext_{\sigma }(S)} to note the set of σ {\displaystyle \sigma } -extensions of the system S {\displaystyle S} . In the case of the system S {\displaystyle S} in the figure above, E x t σ ( S ) = { { a , d } } {\displaystyle Ext_{\sigma }(S)=\{\{a,d\}\}} for every Dung's semantic—the system is well-founded. That explains why the semantics coincide, and the accepted arguments are: a {\displaystyle a} and d {\displaystyle d} . ==== Labellings ==== Labellings are a more expressive way than extensions to express the acceptance of the arguments. Concretely, a labelling is a mapping that associates every argument with a label in (the argument is accepted), out (the argument is rejected), or undec (the argument is undefined—not accepted or refused). One can also note a labelling as a set of pairs ( a r g u m e n t , l a b e l ) {\displaystyle ({\mathit {argument}},{\mathit {label}})} . Such a mapping does not make sense without additional constraint. The notion of reinstatement labelling guarantees the sense of the mapping. L {\displaystyle L} is a reinstatement labelling on the system S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } if and only if : ∀ a ∈ A , L ( a ) = i n {\displaystyle \forall a\in A,L(a)={\mathit {in}}} if and only if ∀ b ∈ A {\displaystyle \forall b\in A} such that ( b , a ) ∈ R , L ( b ) = o u t {\displaystyle (b,a)\in R,L(b)={\mathit {out}}} ∀ a ∈ A , L ( a ) = o u t {\displaystyle \forall a\in A,L(a)={\mathit {out}}} if and only if ∃ b ∈ A {\displaystyle \exists b\in A} such that ( b , a ) ∈ R {\displaystyle (b,a)\in R} and L ( b ) = i n {\displaystyle L(b)={\mathit {in}}} ∀ a ∈ A , L ( a ) = u n d e c {\displaystyle \forall a\in A,L(a)={\mathit {undec}}} if and only if L ( a ) ≠ i n {\displaystyle L(a)\neq {\mathit {in}}} and L ( a ) ≠ o u t {\displaystyle L(a)\neq {\mathit {out}}} One can convert every extension into a reinstatement labelling: the arguments of the extension are in, those attacked by an argument of the extension are out, and the others are undec. Conversely, one can build an extension from a reinstatement labelling just by keeping the arguments in. Indeed, Caminada proved that the reinstatement labellings and the complete extensions can be mapped in a bijective way. Moreover, the other Datung's semantics can be associated to some particular sets of reinstatement labellings. Reinstatement labellings distinguish arguments not accepted because they are attacked by accepted arguments from undefined arguments—that is, those that are not defended cannot defend themselves. An argument is undec if it is attacked by at least another undec. If it is attacked only by arguments out, it must be in, and if it is attacked some argument in, then it is out. The unique reinstatement labelling that corresponds to the system S {\displaystyle S} above is L = { ( a , i n ) , ( b , o u t ) , ( c , o u t ) , ( d , i n ) } {\displaystyle L=\{(a,{\mathit {in}}),(b,{\mathit {out}}),(c,{\mathit {out}}),(d,{\mathit {in}})\}} . === Inference from an argumentation system === In the general case when several extensions are computed for a given semantic σ {\displaystyle \sigma } , the agent that reasons from the system can use several mechanisms to infer information: Credulous inference: the agent accepts an argument if it belongs to at least one of the σ {\displaystyle \sigma } -extensions—in which case, the agent risks accepting some arguments that are not acceptable together ( a {\displaystyle a} attacks b {\displaystyle b} , and a {\displaystyle a} and b {\displaystyle b} each belongs to an extension) Skeptical inference: the agent accepts an argument only if it belongs to every σ {\displaystyle \sigma } -extension. In this case, the agent risks deducing too little information (if the intersection of the extensions is empty or has a very small cardinal). For these two methods to infer information, one can identify the set of accepted arguments, respectively C r σ ( S ) {\displaystyle Cr_{\sigma }(S)} the set of the arguments credulously accepted under the semantic σ {\displaystyle \sigma } , and S c σ ( S ) {\displaystyle Sc_{\sigma }(S)} the set of arguments accepted skeptically under the semantic σ {\displaystyle \sigma } (the σ {\displaystyle \sigma } can be missed if there is no possible ambiguity about the semantic). Of course, when there is only one extension (for instance, when the system is well-founded), this problem is very simple: the agent accepts arguments of the unique extension and rejects others. The same reasoning can be done with labellings that correspond to the chosen semantic : an argument can be accepted if it is in for each labelling and refused if it is out for each labelling, the others being in an undecided state (the status of the arguments can remind the The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning framAveraged one-dependence estimators
Neural Networks (journal)
Argumentation framework
Distribution learning theory