AI For Np Students

AI For Np Students — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Statistical shape analysis

    Statistical shape analysis

    Statistical shape analysis is an analysis of the geometrical properties of some given set of shapes by statistical methods. For instance, it could be used to quantify differences between male and female gorilla skull shapes, normal and pathological bone shapes, leaf outlines with and without herbivory by insects, etc. Important aspects of shape analysis are to obtain a measure of distance between shapes, to estimate mean shapes from (possibly random) samples, to estimate shape variability within samples, to perform clustering and to test for differences between shapes. One of the main methods used is principal component analysis (PCA). Statistical shape analysis has applications in various fields, including medical imaging, computer vision, computational anatomy, sensor measurement, and geographical profiling. == Landmark-based techniques == In the point distribution model, a shape is determined by a finite set of coordinate points, known as landmark points. These landmark points often correspond to important identifiable features such as the corners of the eyes. Once the points are collected some form of registration is undertaken. This can be a baseline methods used by Fred Bookstein for geometric morphometrics in anthropology. Or an approach like Procrustes analysis which finds an average shape. David George Kendall investigated the statistical distribution of the shape of triangles, and represented each triangle by a point on a sphere. He used this distribution on the sphere to investigate ley lines and whether three stones were more likely to be co-linear than might be expected. Statistical distribution like the Kent distribution can be used to analyse the distribution of such spaces. Alternatively, shapes can be represented by curves or surfaces representing their contours, by the spatial region they occupy. == Shape deformations == Differences between shapes can be quantified by investigating deformations transforming one shape into another. In particular a diffeomorphism preserves smoothness in the deformation. This was pioneered in D'Arcy Thompson's On Growth and Form before the advent of computers. Deformations can be interpreted as resulting from a force applied to the shape. Mathematically, a deformation is defined as a mapping from a shape x to a shape y by a transformation function Φ {\displaystyle \Phi } , i.e., y = Φ ( x ) {\displaystyle y=\Phi (x)} . Given a notion of size of deformations, the distance between two shapes can be defined as the size of the smallest deformation between these shapes. Diffeomorphometry is the focus on comparison of shapes and forms with a metric structure based on diffeomorphisms, and is central to the field of Computational anatomy. Diffeomorphic registration, introduced in the 90's, is now an important player with existing codes bases organized around ANTS, DARTEL, DEMONS, LDDMM, StationaryLDDMM, and FastLDDMM are examples of actively used computational codes for constructing correspondences between coordinate systems based on sparse features and dense images. Voxel-based morphometry (VBM) is an important technology built on many of these principles. Methods based on diffeomorphic flows are also used. For example, deformations could be diffeomorphisms of the ambient space, resulting in the LDDMM (Large Deformation Diffeomorphic Metric Mapping) framework for shape comparison.

    Read more →
  • Vanishing gradient problem

    Vanishing gradient problem

    In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time). == Prototypical models == This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. === Recurrent network model === A generic recurrent network has hidden states h 1 , h 2 , … {\displaystyle h_{1},h_{2},\dots } , inputs u 1 , u 2 , … {\displaystyle u_{1},u_{2},\dots } , and outputs x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } . Let it be parameterized by θ {\displaystyle \theta } , so that the system evolves as ( h t , x t ) = F ( h t − 1 , u t , θ ) {\displaystyle (h_{t},x_{t})=F(h_{t-1},u_{t},\theta )} Often, the output x t {\displaystyle x_{t}} is a function of h t {\displaystyle h_{t}} , as some x t = G ( h t ) {\displaystyle x_{t}=G(h_{t})} . The vanishing gradient problem already presents itself clearly when x t = h t {\displaystyle x_{t}=h_{t}} , so we simplify our notation to the special case with: x t = F ( x t − 1 , u t , θ ) {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )} Now, take its differential: d x t = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) d x t − 1 = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) [ ∇ θ F ( x t − 2 , u t − 1 , θ ) d θ + ∇ x F ( x t − 2 , u t − 1 , θ ) d x t − 2 ] ⋮ = [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] d θ {\displaystyle {\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}} Training the network requires us to define a loss function to be minimized. Let it be L ( x T , u 1 , … , u T ) {\displaystyle L(x_{T},u_{1},\dots ,u_{T})} , then minimizing it by gradient descent gives Δ θ = − η ⋅ [ ∇ x L ( x T ) ( ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ) ] T {\displaystyle \Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}} where η {\displaystyle \eta } is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ∇ x F ( x t − 3 , u t − 2 , θ ) ⋯ {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots } ==== Example: recurrent network with sigmoid activation ==== For a concrete example, consider a typical recurrent network defined by x t = F ( x t − 1 , u t , θ ) = W rec σ ( x t − 1 ) + W in u t + b {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b} where θ = ( W rec , W in ) {\displaystyle \theta =(W_{\text{rec}},W_{\text{in}})} is the network parameter, σ {\displaystyle \sigma } is the sigmoid activation function, applied to each vector coordinate separately, and b {\displaystyle b} is the bias vector. Then, ∇ x F ( x t − 1 , u t , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))} , and so ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ⋯ ∇ x F ( x t − k , u t − k + 1 , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) W rec diag ⁡ ( σ ′ ( x t − 2 ) ) ⋯ W rec diag ⁡ ( σ ′ ( x t − k ) ) {\displaystyle {\begin{aligned}&\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\&=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-2}))\cdots W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-k}))\end{aligned}}} Since | σ ′ | ≤ 1 {\displaystyle \left|\sigma '\right|\leq 1} , the operator norm of the above multiplication is bounded above by ‖ W rec ‖ k {\displaystyle \left\|W_{\text{rec}}\right\|^{k}} . So if the spectral radius of W rec {\displaystyle W_{\text{rec}}} is γ < 1 {\displaystyle \gamma <1} , then at large k {\displaystyle k} , the above multiplication has operator norm bounded above by γ k → 0 {\displaystyle \gamma ^{k}\to 0} . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential): ∇ θ L = ∇ x L ( x T , u 1 , … , u T ) [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] {\displaystyle \nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},\dots ,u_{T})\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]} The components of ∇ θ F ( x , u , θ ) {\displaystyle \nabla _{\theta }F(x,u,\theta )} are just components of σ ( x ) {\displaystyle \sigma (x)} and u {\displaystyle u} , so if u t , u t − 1 , … {\displaystyle u_{t},u_{t-1},\dots } are bounded, then ‖ ∇ θ F ( x t − k − 1 , u t − k , θ ) ‖ {\displaystyle \left\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\right\|} is also bounded by some M > 0 {\displaystyle M>0} , and so the terms in ∇ θ L {\displaystyle \nabla _{\theta }L} decay as M γ k {\displaystyle M\gamma ^{k}} . This means that, effectively, ∇ θ L {\displaystyle \nabla _{\theta }L} is affected only by the first O ( γ − 1 ) {\displaystyle O(\gamma ^{-1})} terms in the sum. If γ ≥ 1 {\displaystyle \gamma \geq 1} , the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer. === Dynamical systems model === Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation: x t + 1 = ( 1 − ε ) x t + ε σ ( w x t + b ) + ε w ′ u t {\displaystyle x_{t+1}=(1-\varepsilon )x_{t}+\varepsilon \sigma (wx_{t}+b)+\varepsilon w'u_{t}} At the small ε {\displaystyle \varepsilon } limit, the dynamics of the network becomes d x d t = − x ( t ) + σ ( w x ( t ) + b ) + w ′ u ( t ) {\displaystyle {\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)} Consider first the autonomous case, with u = 0 {\displaystyle u=0} . Set w = 5.0 {\displaystyle w=5.0} , and vary b {\displaystyle b} in [ − 3 , − 2 ] {\displaystyle [-3,-2]} . As b {\displaystyle b} decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are ( x , b ) = ( x , ln ⁡ ( x 1 − x ) − 5 x ) {\displaystyle (x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)} . Now consider Δ x ( T ) Δ x ( 0 ) {\displaystyle {\frac {\Delta x(T)}{\Delta x(0)}}} and Δ x ( T ) Δ b {\displaystyle {\frac {\Delta x(T)}{\Delta b}}} , where T {\displaystyle T} is large enough that the system has settled into one of the stable points. If ( x ( 0 ) , b ) {\displaystyle (x(0),b)} puts the system very close to an unstable point, then a tiny variation in x ( 0 ) {\displaystyle x(0)} or b {\displaystyle b} wo

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Calibration (statistics)

    Calibration (statistics)

    There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes. In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." == In classification == Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case: Assignment value approach, see Garczarek (2002) Bayes approach, see Bennett (2002) Isotonic regression, see Zadrozny and Elkan (2002) Platt scaling (a form of logistic regression), see Lewis and Gale (1994) and Platt (1999) Bayesian Binning into Quantiles (BBQ) calibration, see Naeini, Cooper, Hauskrecht (2015) Beta calibration, see Kull, Filho, Flach (2017) === In probability prediction and forecasting === In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting. This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your discrimination is perfect but your calibration is miserable". In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill. == In regression == The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. This can be known as "inverse regression"; there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two: Reduction to binary tasks and subsequent pairwise coupling, see Hastie and Tibshirani (1998) Dirichlet calibration, see Gebel (2009) === Example === One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

    Read more →
  • Elastic cloud storage

    Elastic cloud storage

    An elastic cloud is a cloud computing offering that provides variable service levels based on changing needs. Elasticity is an attribute that can be applied to most cloud services. It states that the capacity and performance of any given cloud service can expand or contract according to a customer's requirements and that this can potentially be changed automatically as a consequence of some software-driven event or, at worst, can be reconfigured quickly by the customer's infrastructure management team. Elasticity has been described as one of the five main principles of cloud computing by Rosenburg and Mateos in The Cloud at Your Service - Manning 2011. == History == Cloud computing was first described by Gillet and Kapor in 1996; however, the first practical implementation was a consequence of a strategy to leverage Amazon's excess data center capacity. Amazon and other pioneers of the commercial use of this technology were primarily interested in providing a “public” cloud service, whereby they could offer customers the benefits of using the cloud, particularly the utility-based pricing model benefit. Other suppliers followed suit with a range of cloud-based models all offering elasticity as a core component, but these suppliers were only offering this service as an element of their public cloud service. Due to perceived weaknesses in security, or at least a lack of proven compliance, many organizations, particularly in the financial and public sectors, have been slow adopters of cloud technologies. These wary organizations can achieve some of the benefits of cloud computing by adopting private cloud technologies. An alternative form of the elastic cloud has been offered by vendors such as EMC and IBM, whereby the service is based around an enterprise's own infrastructure but still retains elements of elasticity and the potential to bill by consumption. == Description == Elasticity in cloud computing is the ability for the organization to adjust its storage requirements in terms of capacity and processing with respect to operational requirements. This has the following benefits: Operational Benefits - Services can be acquired quickly, meaning that the evolving requirements of the business can be addressed almost immediately, giving an organization a potential agility advantage. A properly implemented elastic system will provision/de-provision according to application demands, so if a particular business has activity spikes then the provision can be enabled to match the demand and the capacity can be re-allocated. Research and Development (R&D) Projects - R&D activities are no longer hindered by a requirement to secure a capex budget prior to a project starting. Capability can simply be provisioned from the cloud and released at the end of the exercise. Testing and Deployment - With most large-scale projects a size test needs to be performed prior to final rollout. By taking advantage of the elasticity of the cloud and creating a full-scale avatar of the proposed production system, realistic data and traffic volumes can be provisioned and released as needed. Expensive Resources Allocated - This will normally apply only in the context where a customer is applying at least some of their own servers as part of a cloud infrastructure, specifically where a business (for performance reasons) has decided to invest in solid-state storage as opposed to spinning platters. There are instances when, due to activity spikes, a less critical process may need to be moved from the high-performance resources to more traditional storage. Server Specification - When a customer has elected to own/lease hardware, they can select and specify servers that are specifically tuned to meet the likely needs of their operation (i.e., directly controlling the cost/benefit equation). Utility Based Payments - There is, of course, a key cost driver in this process, and the notion that you should pay for what you consume is acceptable for many organizations. When hardware capacity is sourced internally, organizations need to over-provision. This applies just as much to traditional outsourcing as it does to capex-related expenditure on in-house servers. Cloud Platform – At the heart of any cloud storage system is the ability to manage hyperscale object storage and a Hadoop Distributed Files System (HDFS). Elastic storage capability is particularly well suited to hyperscale and Hadoop environments, where its capability to rapidly respond to changing circumstances and priorities is essential

    Read more →
  • Triplet loss

    Triplet loss

    Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection. Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images. == Definition == The loss function is defined using triplets of training points of the form ( A , P , N ) {\displaystyle (A,P,N)} . In each triplet, A {\displaystyle A} (called an "anchor point") denotes a reference point of a particular identity, P {\displaystyle P} (called a "positive point") denotes another point of the same identity in point A {\displaystyle A} , and N {\displaystyle N} (called a "negative point") denotes a point of an identity different from the identity in point A {\displaystyle A} and P {\displaystyle P} . Let x {\displaystyle x} be some point and let f ( x ) {\displaystyle f(x)} be the embedding of x {\displaystyle x} in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of f ( x ) {\displaystyle f(x)} is unity (the L2 norm of a vector X {\displaystyle X} in a finite dimensional Euclidean space is denoted by ‖ X ‖ {\displaystyle \Vert X\Vert } .) We assemble m {\displaystyle m} triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} in the training data set: ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 + α < ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 {\displaystyle \Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}+\alpha <\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}} The variable α {\displaystyle \alpha } is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2. Thus, the full form of the function to be minimized is the following: L = ∑ i = 1 m max ( ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 − ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 + α , 0 ) {\displaystyle L=\sum _{i=1}^{m}\max {\Big (}\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}-\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}+\alpha ,0{\Big )}} == Intuition == A baseline for understanding the effectiveness of triplet loss is the contrastive loss, which operates on pairs of samples (rather than triplets). Training with the contrastive loss pulls embeddings of similar pairs closer together, and pushes dissimilar pairs apart. Its pairwise approach is greedy, as it considers each pair in isolation. Triplet loss innovates by considering relative distances. Its goal is that the embedding of an anchor (query) point be closer to positive points than to negative points (also accounting for the margin). It does not try to further optimize the distances once this requirement is met. This is approximated by simultaneously considering two pairs (anchor-positive and anchor-negative), rather than each pair in isolation. == Triplet "mining" == One crucial implementation detail when training with triplet loss is triplet "mining", which focuses on the smart selection of triplets for optimization. This process adds an additional layer of complexity compared to contrastive loss. A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset. In general, the set of valid triplets of the form ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} is very large. To speed-up training convergence, it is essential to focus on challenging triplets. In the FaceNet paper, several options were explored, eventually arriving at the following. For each anchor-positive pair, the algorithm considers only semi-hard negatives. These are negatives that violate the triplet requirement (i.e, are "hard"), but lie farther from the anchor than the positive (not too hard). Restated, for each A ( i ) {\displaystyle A^{(i)}} and P ( i ) {\displaystyle P^{(i)}} , they seek N ( i ) {\displaystyle N^{(i)}} such that: The rationale for this design choice is heuristic. It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive). Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate local minima. Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as online mining), after embeddings were computed for all points in the batch. While ideally the entire dataset could be used, this is impractical in general. To support a large search space for triplets, the FaceNet authors used very large batches (1800 samples). Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them. == Extensions == Triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks. In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture. Other extensions involve specifying multiple negatives (multiple negatives ranking loss).

    Read more →
  • Genetic operator

    Genetic operator

    A genetic operator is an operator used in evolutionary algorithms (EA) to guide the algorithm towards a solution to a given problem. There are three main types of operators (mutation, crossover and selection), which must work in conjunction with one another in order for the algorithm to be successful. Genetic operators are used to create and maintain genetic diversity (mutation operator), combine existing solutions (also known as chromosomes) into new solutions (crossover) and select between solutions (selection). The classic representatives of evolutionary algorithms include genetic algorithms, evolution strategies, genetic programming and evolutionary programming. In his book discussing the use of genetic programming for the optimization of complex problems, computer scientist John Koza has also identified an 'inversion' or 'permutation' operator; however, the effectiveness of this operator has never been conclusively demonstrated and this operator is rarely discussed in the field of genetic programming. For combinatorial problems, however, these and other operators tailored to permutations are frequently used by other EAs. Mutation (or mutation-like) operators are said to be unary operators, as they only operate on one chromosome at a time. In contrast, crossover operators are said to be binary operators, as they operate on two chromosomes at a time, combining two existing chromosomes into one new chromosome. == Operators == Genetic variation is a necessity for the process of evolution. Genetic operators used in evolutionary algorithms are analogous to those in the natural world: survival of the fittest, or selection; reproduction (crossover, also called recombination); and mutation. === Selection === Selection operators give preference to better candidate solutions (chromosomes), allowing them to pass on their 'genes' to the next generation (iteration) of the algorithm. The best solutions are determined using some form of objective function (also known as a 'fitness function' in evolutionary algorithms), before being passed to the crossover operator. Different methods for choosing the best solutions exist, for example, fitness proportionate selection and tournament selection. A further or the same selection operator is used to determine the individuals for being selected to form the next parental generation. The selection operator may also ensure that the best solution(s) from the current generation always become(s) a member of the next generation without being altered; this is known as elitism or elitist selection. === Crossover === Crossover is the process of taking more than one parent solutions (chromosomes) and producing a child solution from them. By recombining portions of good solutions, the evolutionary algorithm is more likely to create a better solution. As with selection, there are a number of different methods for combining the parent solutions, including the edge recombination operator (ERO) and the 'cut and splice crossover' and 'uniform crossover' methods. The crossover method is often chosen to closely match the chromosome's representation of the solution; this may become particularly important when variables are grouped together as building blocks, which might be disrupted by a non-respectful crossover operator. Similarly, crossover methods may be particularly suited to certain problems; the ERO is considered a good option for solving the travelling salesman problem. === Mutation === The mutation operator encourages genetic diversity amongst solutions and attempts to prevent the evolutionary algorithm converging to a local minimum by stopping the solutions becoming too close to one another. In mutating the current pool of solutions, a given solution may change between slightly and entirely from the previous solution. By mutating the solutions, an evolutionary algorithm can reach an improved solution solely through the mutation operator. Again, different methods of mutation may be used; these range from a simple bit mutation (flipping random bits in a binary string chromosome with some low probability) to more complex mutation methods in which genes in the solution are changed, for example by adding a random value from the Gaussian distribution to the current gene value. As with the crossover operator, the mutation method is usually chosen to match the representation of the solution within the chromosome. == Combining operators == While each operator acts to improve the solutions produced by the evolutionary algorithm working individually, the operators must work in conjunction with each other for the algorithm to be successful in finding a good solution. Using the selection operator on its own will tend to fill the solution population with copies of the best solution from the population. If the selection and crossover operators are used without the mutation operator, the algorithm will tend to converge to a local minimum, that is, a good but sub-optimal solution to the problem. Using the mutation operator on its own leads to a random walk through the search space. Only by using all three operators together can the evolutionary algorithm become a noise-tolerant global search algorithm, yielding good solutions to the problem at hand.

    Read more →
  • K-nearest neighbors algorithm

    K-nearest neighbors algorithm

    In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. In classification, a new example is assigned a label based on the labels of its k nearest training examples; in regression, the prediction is computed from the values of those neighbors. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. The k-NN algorithm can also be generalized for regression. In k-NN regression, also known as nearest neighbor smoothing, the output is the property value for the object. This value is the average of the values of k nearest neighbors. If k = 1, then the output is simply assigned to the value of that single nearest neighbor, also known as nearest neighbor interpolation. For both classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that nearer neighbors contribute more to the average than distant ones. For example, a common weighting scheme consists of giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The input consists of the k closest training examples in a data set. The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity (sometimes even a disadvantage) of the k-NN algorithm is its sensitivity to the local structure of the data. In k-NN classification the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance, if the features represent different physical units or come in vastly different scales, then feature-wise normalizing of the training data can greatly improve its accuracy. == Statistical setting == Suppose we have pairs ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , … , ( X n , Y n ) {\displaystyle (X_{1},Y_{1}),(X_{2},Y_{2}),\dots ,(X_{n},Y_{n})} taking values in R d × { 1 , 2 } {\displaystyle \mathbb {R} ^{d}\times \{1,2\}} , where Y is the class label of X, so that X | Y = r ∼ P r {\displaystyle X|Y=r\sim P_{r}} for r = 1 , 2 {\displaystyle r=1,2} (and probability distributions P r {\displaystyle P_{r}} ). Given some norm ‖ ⋅ ‖ {\displaystyle \|\cdot \|} on R d {\displaystyle \mathbb {R} ^{d}} and a point x ∈ R d {\displaystyle x\in \mathbb {R} ^{d}} , let ( X ( 1 ) , Y ( 1 ) ) , … , ( X ( n ) , Y ( n ) ) {\displaystyle (X_{(1)},Y_{(1)}),\dots ,(X_{(n)},Y_{(n)})} be a reordering of the training data such that ‖ X ( 1 ) − x ‖ ≤ ⋯ ≤ ‖ X ( n ) − x ‖ {\displaystyle \|X_{(1)}-x\|\leq \dots \leq \|X_{(n)}-x\|} . == Algorithm == The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as large margin nearest neighbor or neighborhood components analysis. A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. k-NN can then be applied to the SOM. == Parameter selection == The best choice of k depends upon the data; generally, larger values of k reduces effect of the noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm. The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method. == The 1-nearest neighbor classifier == The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point x to the class of its closest neighbour in the feature space, that is C n 1 n n ( x ) = Y ( 1 ) {\displaystyle C_{n}^{1nn}(x)=Y_{(1)}} . As the size of training data set approaches infinity, the one nearest neighbour classifier guarantees an error rate of no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). == The weighted nearest neighbour classifier == The k-nearest neighbour classifier can be viewed as assigning the k nearest neighbours a weight 1 / k {\displaystyle 1/k} and all others 0 weight. This can be generalised to weighted nearest neighbour classifiers. That is, where the ith nearest neighbour is assigned a weight w n i {\displaystyle w_{ni}} , with ∑ i = 1 n w n i = 1 {\textstyle \sum _{i=1}^{n}w_{ni}=1} . An analogous result on the strong consistency of weighted nearest neighbour classifiers also holds. Let C n w n n {\displaystyle C_{n}^{wnn}} denote the weighted nearest classifier with weights { w n i } i = 1 n {\displaystyle \{w_{ni}\}_{i=1}^{n}} . Subject to regularity conditions, which in asymptotic theory are conditional variables which require assumptions to differentiate among parameters with some criteria. On the class distributions the excess risk has the following asymptotic expansion R R ( C n w n n ) − R R ( C Bayes ) = ( B 1 s n 2 + B 2 t n 2 ) { 1 + o ( 1 ) } , {\displaystyle {\mathcal {R}}_{\mathcal {R}}(C_{n}^{wnn})-{\mathcal {R}}_{\mathcal {R}}(C^{\text{Bayes}})=\left(B_{1}s_{n}^{2}+B_{2}t_{n}^{2}\right)\{1+o(1)\},} for constants B 1 {\displaystyle B_{1}} and B 2 {\displaystyle B_{2}} where s n 2 = ∑ i = 1 n w n i 2 {\displaystyle s_{n}^{2}=\sum _{i=1}^{n}w_{ni}^{2}} and t n = n − 2 / d ∑ i = 1 n w n i { i 1 + 2 / d − ( i − 1 ) 1 + 2 / d } {\displaystyle t_{n}=n^{-2/d}\sum _{i=1}^{n}w_{ni}\left\{i^{1+2/d}-(i-1)^{1+2/d}\right\}} . The optimal weighting scheme { w n i ∗ } i = 1 n {\displaystyle \{w_{ni}^{}\}_{i=1}^{n}} , that balances the two terms in the display above, is given as follows: set k ∗ = ⌊ B n 4 d + 4 ⌋ {\displaystyle k^{}=\lfloor Bn^{\frac {4}{d+4}}\rfloor } , w n i ∗ = 1 k ∗ [ 1 + d 2 − d 2 k ∗ 2 / d { i 1 + 2 / d − ( i − 1 ) 1 + 2 / d } ] {\displaystyle w_{ni}^{}={\frac {1}{k^{}}}\left[1+{\frac {d}{2}}-{\frac {d}{2{k^{}}^{2/d}}}\{i^{1+2/d}-(i-1)^{1+2/d}\}\right]} for i = 1 , 2 , … , k ∗ {\displaystyle i=1,2,\dots ,k^{}} and w n i ∗ = 0 {\displaystyle w_{ni}^{}=0} for i = k ∗ + 1 , … , n {\displaystyle i=k^{}+1,\dots ,n} . With optimal weights the dominant term in the asymptotic expansion of the excess risk is O ( n − 4 d + 4 ) {\displaystyle {\mathcal {O}}(n^{-{\frac {4}{d+4}}})}

    Read more →
  • You.com

    You.com

    You.com is an artificial intelligence search startup that has pivoted away from consumer search engine operations toward business-focused AI tools and APIs. The company was founded in 2020 by Richard Socher, the former chief scientist at Salesforce, and Bryan McCann, a former NLP researcher at Salesforce. == History == Following its 2020 founding, You.com opened its public beta on November 9, 2021, and received $20 million in funding led by Salesforce founder and CEO Marc Benioff. Other investors include Breyer Capital, Sound Ventures, and Day One Ventures. The domain You.com was initially purchased in 1996 by Benioff. Benioff invested in You.com and transferred ownership of the You.com domain name to the company. In July 2022, You.com announced its $25 million Series A funding round led by Radical Ventures with participation from Time Ventures, Breyer Capital, Norwest Venture Partners and Day One Ventures. In September 2024, You.com raised $50 million in Series B funding led by Georgian. In September 2025, You.com raised $100 million in Series C funding led by Cox Enterprises at a $1.5 billion valuation, achieving unicorn status. == Business model == You.com generates revenue primarily through enterprise sales of search APIs and AI tools. The platform provides web search capabilities that can be integrated into enterprise applications and AI agents. == Features == On December 23, 2022, You.com was the first search engine to launch an LLM chatbot with live web results alongside its responses. Initially known as YouChat, the chatbot was primarily based on the GPT-3.5 large language model and could answer questions, suggest ideas, translate text, summarize articles, compose emails, and write code snippets, while staying up-to-date with current events and citing sources. Several further versions of YouChat were released. The second version, called YouChat 2.0, was released on February 7, 2023, incorporated improved conversational AI and community-built applications by blending a large language model named C-A-L (Chat, Apps, and Links). This update enabled YouChat to provide results in various formats, such as charts, photos, videos, tables, graphs, text or code, so users can find answers without leaving the search results page. YouChat 3.0, unveiled on May 4, 2023, combined chat functionality with results from Reddit, TikTok, Stack Overflow and Wikipedia. === YouPro === On June 21, 2023, You.com introduced YouPro, a paid subscription. Both free and paid versions provide access to large language models connected to the internet with citation capabilities. === ARI === In February 2025, You.com launched ARI (Advanced Research and Insights), a deep research agent that scans over 400 sources simultaneously to produce research reports with verified citations and interactive graphs, charts, and visualizations. The platform targets regulated industries where comprehensive source verification is critical, with customers including healthcare publishers and advisory firms. == Reception == You.com was named one of TIME's Best Inventions of 2022. You.com's ARI (Advanced Research & Insights) feature was named one of TIME's Best Inventions of 2025.

    Read more →
  • Promoter based genetic algorithm

    Promoter based genetic algorithm

    The promoter based genetic algorithm (PBGA) is a genetic algorithm for neuroevolution developed by F. Bellas and R.J. Duro in the Integrated Group for Engineering Research (GII) at the University of Coruña, in Spain. It evolves variable size feedforward artificial neural networks (ANN) that are encoded into sequences of genes for constructing a basic ANN unit. Each of these blocks is preceded by a gene promoter acting as an on/off switch that determines if that particular unit will be expressed or not. == PBGA basics == The basic unit in the PBGA is a neuron with all of its inbound connections as represented in the following figure: The genotype of a basic unit is a set of real valued weights followed by the parameters of the neuron and proceeded by an integer valued field that determines the promoter gene value and, consequently, the expression of the unit. By concatenating units of this type we can construct the whole network. With this encoding it is imposed that the information that is not expressed is still carried by the genotype in evolution but it is shielded from direct selective pressure, maintaining this way the diversity in the population, which has been a design premise for this algorithm. Therefore, a clear difference is established between the search space and the solution space, permitting information learned and encoded into the genotypic representation to be preserved by disabling promoter genes. == Results == The PBGA was originally presented within the field of autonomous robotics, in particular in the real time learning of environment models of the robot. It has been used inside the Multilevel Darwinist Brain (MDB) cognitive mechanism developed in the GII for real robots on-line learning. In another paper it is shown how the application of the PBGA together with an external memory that stores the successful obtained world models, is an optimal strategy for adaptation in dynamic environments. Recently, the PBGA has provided results that outperform other neuroevolutionary algorithms in non-stationary problems, where the fitness function varies in time.

    Read more →
  • Tucker decomposition

    Tucker decomposition

    In mathematics, Tucker decomposition decomposes a tensor into a set of matrices and one small core tensor. It is named after Ledyard R. Tucker although it goes back to Hitchcock in 1927. Initially described as a three-mode extension of factor analysis and principal component analysis it may actually be generalized to higher mode analysis, which is also called higher-order singular value decomposition (HOSVD) or the M-mode SVD. The algorithm to which the literature typically refers when discussing the Tucker decomposition or the HOSVD is the M-mode SVD algorithm introduced by Vasilescu and Terzopoulos, but misattributed to Tucker or De Lathauwer etal. It may be regarded as a more flexible PARAFAC (parallel factor analysis) model. In PARAFAC the core tensor is restricted to be "diagonal". In practice, Tucker decomposition is used as a modelling tool. For instance, it is used to model three-way (or higher way) data by means of relatively small numbers of components for each of the three or more modes, and the components are linked to each other by a three- (or higher-) way core array. The model parameters are estimated in such a way that, given fixed numbers of components, the modelled data optimally resemble the actual data in the least squares sense. The model gives a summary of the information in the data, in the same way as principal components analysis does for two-way data. For a 3rd-order tensor T ∈ F n 1 × n 2 × n 3 {\displaystyle T\in F^{n_{1}\times n_{2}\times n_{3}}} , where F {\displaystyle F} is either R {\displaystyle \mathbb {R} } or C {\displaystyle \mathbb {C} } , Tucker Decomposition can be denoted as follows, T = T × 1 U ( 1 ) × 2 U ( 2 ) × 3 U ( 3 ) {\displaystyle T={\mathcal {T}}\times _{1}U^{(1)}\times _{2}U^{(2)}\times _{3}U^{(3)}} where T ∈ F d 1 × d 2 × d 3 {\displaystyle {\mathcal {T}}\in F^{d_{1}\times d_{2}\times d_{3}}} is the core tensor, a 3rd-order tensor that contains the 1-mode, 2-mode and 3-mode singular values of T {\displaystyle T} , which are defined as the Frobenius norm of the 1-mode, 2-mode and 3-mode slices of tensor T {\displaystyle {\mathcal {T}}} respectively. U ( 1 ) , U ( 2 ) , U ( 3 ) {\displaystyle U^{(1)},U^{(2)},U^{(3)}} are unitary matrices in F d 1 × n 1 , F d 2 × n 2 , F d 3 × n 3 {\displaystyle F^{d_{1}\times n_{1}},F^{d_{2}\times n_{2}},F^{d_{3}\times n_{3}}} respectively. The k-mode product (k = 1, 2, 3) of T {\displaystyle {\mathcal {T}}} by U ( k ) {\displaystyle U^{(k)}} is denoted as T × U ( k ) {\displaystyle {\mathcal {T}}\times U^{(k)}} with entries as ( T × 1 U ( 1 ) ) ( i 1 , j 2 , j 3 ) = ∑ j 1 = 1 d 1 T ( j 1 , j 2 , j 3 ) U ( 1 ) ( j 1 , i 1 ) ( T × 2 U ( 2 ) ) ( j 1 , i 2 , j 3 ) = ∑ j 2 = 1 d 2 T ( j 1 , j 2 , j 3 ) U ( 2 ) ( j 2 , i 2 ) ( T × 3 U ( 3 ) ) ( j 1 , j 2 , i 3 ) = ∑ j 3 = 1 d 3 T ( j 1 , j 2 , j 3 ) U ( 3 ) ( j 3 , i 3 ) {\displaystyle {\begin{aligned}({\mathcal {T}}\times _{1}U^{(1)})(i_{1},j_{2},j_{3})&=\sum _{j_{1}=1}^{d_{1}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(1)}(j_{1},i_{1})\\({\mathcal {T}}\times _{2}U^{(2)})(j_{1},i_{2},j_{3})&=\sum _{j_{2}=1}^{d_{2}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(2)}(j_{2},i_{2})\\({\mathcal {T}}\times _{3}U^{(3)})(j_{1},j_{2},i_{3})&=\sum _{j_{3}=1}^{d_{3}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(3)}(j_{3},i_{3})\end{aligned}}} Altogether, the decomposition may also be written more directly as T ( i 1 , i 2 , i 3 ) = ∑ j 1 = 1 d 1 ∑ j 2 = 1 d 2 ∑ j 3 = 1 d 3 T ( j 1 , j 2 , j 3 ) U ( 1 ) ( j 1 , i 1 ) U ( 2 ) ( j 2 , i 2 ) U ( 3 ) ( j 3 , i 3 ) {\displaystyle T(i_{1},i_{2},i_{3})=\sum _{j_{1}=1}^{d_{1}}\sum _{j_{2}=1}^{d_{2}}\sum _{j_{3}=1}^{d_{3}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(1)}(j_{1},i_{1})U^{(2)}(j_{2},i_{2})U^{(3)}(j_{3},i_{3})} Taking d i = n i {\displaystyle d_{i}=n_{i}} for all i {\displaystyle i} is always sufficient to represent T {\displaystyle T} exactly, but often T {\displaystyle T} can be compressed or efficiently approximately by choosing d i < n i {\displaystyle d_{i} Read more →

  • Loss function

    Loss function

    In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss. == Examples == === Regret === Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known. === Quadratic loss function === The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t {\displaystyle t} , then a quadratic loss function is λ ( x ) = C ( t − x ) 2 {\displaystyle \lambda (x)=C(t-x)^{2}\;} for some constant C {\displaystyle C} ; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL). Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function. The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, log-cosh and SMAE losses are used when the data has many large outliers. === 0-1 loss function === In statistics and decision theory, a frequently used loss function is the 0-1 loss function L ( y ^ , y ) = { 0 if y = y ^ 1 if y ≠ y ^ {\displaystyle L({\hat {y}},y)={\begin{cases}0&{\text{if }}y={\hat {y}}\\1&{\text{if }}y\neq {\hat {y}}\end{cases}}} In information theory, this loss function is known as Hamming distortion. == Constructing loss and objective functions == In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture. The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers. Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities and the European subsidies for equalizing unemployment rates among 271 German regions. == Expected loss == In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X {\displaystyle X} . === Statistics === Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms. ==== Frequentist expected loss ==== We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P θ {\displaystyle P_{\theta }} , of the observed data, X {\displaystyle X} . This is also referred to as the risk function of the decision rule δ {\displaystyle \delta } and the parameter θ {\displaystyle \theta } . Here the decision rule depends on the outcome of X {\displaystyle X} . The risk function is given by: R ( θ , δ ) = E θ ⁡ L ( θ , δ ( X ) ) = ∫ X L ( θ , δ ( x ) ) d P θ ( x ) . {\displaystyle R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).} Here, θ {\displaystyle \theta } is a fixed but possibly unknown state of nature, X {\displaystyle X} is a vector of observations stochastically drawn from a population, E θ {\displaystyle \operatorname {E} _{\theta }} is the expectation over all population values of X {\displaystyle X} , d P θ {\displaystyle \mathrm {d} P_{\theta }} is a probability measure over the event space of X {\displaystyle X} (parametrized by θ {\displaystyle \theta } ) and the integral is evaluated over the entire support of X {\displaystyle X} . ==== Bayes Risk ==== In a Bayesian approach, the expectation is calculated using the prior distribution π ∗ {\displaystyle \pi ^{}} of the parameter θ {\displaystyle \theta } : ρ ( π ∗ , a ) = ∫ Θ ∫ X L ( θ , a ( x ) ) d P ( x | θ ) d π ∗ ( θ ) = ∫ X ∫ Θ L ( θ , a ( x ) ) d π ∗ ( θ | x ) d M ( x ) {\displaystyle \rho (\pi ^{},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})} where M ( x ) {\displaystyle M(\mathbf {x} )} is known as the predictive likelihood wherein θ {\displaystyle \theta } has been "integrated out," π ∗ ( θ | x ) {\displaystyle \pi ^{}(\theta |\mathbf {x} )} is the posterior distribution, and the order of integration has been changed. One then should choose the action a ∗ {\displaystyle a^{}} which minimises this expected loss, which is referred to as Bayes Risk. In the latter equation, the integrand inside d x {\displaystyle \mathrm {d} x} is known as the Posterior Risk, and minimising it with respect to decision a {\displaystyle a} also minimizes the overall Bayes Risk. This optimal decision, a ∗ {\displaystyle a^{}} is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ {\displaystyle \theta } , over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ {\displaystyle \theta } . ==== Examples in statistics ==== For a scalar parameter θ {\displaystyle \theta } , a decision function whose output θ ^ {\displaystyle {\hat {\theta }}} is an estimate of θ {\displaystyle \theta } , and a quadratic loss function (squared error loss) L ( θ , θ ^ ) = ( θ − θ ^ ) 2 , {\displaystyle L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},} the risk function becomes the mean squared error of the estimate, R ( θ , θ ^ ) = E θ ⁡ [ ( θ − θ ^ ) 2 ] . {\displaystyle R(\theta ,{\hat {\thet

    Read more →
  • Legal information retrieval

    Legal information retrieval

    Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics. In a legal setting, it is frequently important to retrieve all information related to a specific query. However, commonly used boolean search methods (exact matches of specified terms) on full text legal documents have been shown to have an average recall rate as low as 20 percent, meaning that only 1 in 5 relevant documents are actually retrieved. In that case, researchers believed that they had retrieved over 75% of relevant documents. This may result in failing to retrieve important or precedential cases. In some jurisdictions this may be especially problematic, as legal professionals are ethically obligated to be reasonably informed as to relevant legal documents. Legal Information Retrieval attempts to increase the effectiveness of legal searches by increasing the number of relevant documents (providing a high recall rate) and reducing the number of irrelevant documents (a high precision rate). This is a difficult task, as the legal field is prone to jargon, polysemes (words that have different meanings when used in a legal context), and constant change. Techniques used to achieve these goals generally fall into three categories: boolean retrieval, manual classification of legal text, and natural language processing of legal text. == Problems == Application of standard information retrieval techniques to legal text can be more difficult than application in other subjects. One key problem is that the law rarely has an inherent taxonomy. Instead, the law is generally filled with open-ended terms, which may change over time. This can be especially true in common law countries, where each decided case can subtly change the meaning of a certain word or phrase. Legal information systems must also be programmed to deal with law-specific words and phrases. Though this is less problematic in the context of words which exist solely in law, legal texts also frequently use polysemes, words may have different meanings when used in a legal or common-speech manner, potentially both within the same document. The legal meanings may be dependent on the area of law in which it is applied. For example, in the context of European Union legislation, the term "worker" has four different meanings: Any worker as defined in Article 3(a) of Directive 89/391/EEC who habitually uses display screen equipment as a significant part of his normal work. Any person employed by an employer, including trainees and apprentices but excluding domestic servants; Any person carrying out an occupation on board a vessel, including trainees and apprentices, but excluding port pilots and shore personnel carrying out work on board a vessel at the quayside; Any person who, in the Member State concerned, is protected as an employee under national employment law and in accordance with national practice; It also has the common meaning: A person who works at a specific occupation. Though the terms may be similar, correct information retrieval must differentiate between the intended use and irrelevant uses in order to return the correct results. Even if a system overcomes the language problems inherent in law, it must still determine the relevancy of each result. In the context of judicial decisions, this requires determining the precedential value of the case. Case decisions from senior or superior courts may be more relevant than those from lower courts, even where the lower court's decision contains more discussion of the relevant facts. The opposite may be true, however, if the senior court has only a minor discussion of the topic (for example, if it is a secondary consideration in the case). An information retrieval system must also be aware of the authority of the jurisdiction. A case from a binding authority is most likely of more value than one from a non-binding authority. Additionally, the intentions of the user may determine which cases they find valuable. For instance, where a legal professional is attempting to argue a specific interpretation of law, he might find a minor court's decision which supports his position more valuable than a senior courts position which does not. He may also value similar positions from different areas of law, different jurisdictions, or dissenting opinions. Overcoming these problems can be made more difficult because of the large number of cases available. The number of legal cases available via electronic means is constantly increasing (in 2003, US appellate courts handed down approximately 500 new cases per day), meaning that an accurate legal information retrieval system must incorporate methods of both sorting past data and managing new data. == Techniques == === Boolean searches === Boolean searches, where a user may specify terms such as use of specific words or judgments by a specific court, are the most common type of search available via legal information retrieval systems. They are widely implemented but overcome few of the problems discussed above. The recall and precision rates of these searches vary depending on the implementation and searches analyzed. One study found a basic boolean search's recall rate to be roughly 20%, and its precision rate to be roughly 79%. Another study implemented a generic search (that is, not designed for legal uses) and found a recall rate of 56% and a precision rate of 72% among legal professionals. Both numbers increased when searches were run by non-legal professionals, to a 68% recall rate and 77% precision rate. This is likely explained because of the use of complex legal terms by the legal professionals. === Manual classification === In order to overcome the limits of basic boolean searches, information systems have attempted to classify case laws and statutes into more computer friendly structures. Usually, this results in the creation of an ontology to classify the texts, based on the way a legal professional might think about them. These attempt to link texts on the basis of their type, their value, and/or their topic areas. Most major legal search providers now implement some sort of classification search, such as Westlaw's “Natural Language” or LexisNexis' Headnote searches. Additionally, both of these services allow browsing of their classifications, via Westlaw's West Key Numbers or Lexis' Headnotes. Though these two search algorithms are proprietary and secret, it is known that they employ manual classification of text (though this may be computer-assisted). These systems can help overcome the majority of problems inherent in legal information retrieval systems, in that manual classification has the greatest chances of identifying landmark cases and understanding the issues that arise in the text. In one study, ontological searching resulted in a precision rate of 82% and a recall rate of 97% among legal professionals. The legal texts included, however, were carefully controlled to just a few areas of law in a specific jurisdiction. The major drawback to this approach is the requirement of using highly skilled legal professionals and large amounts of time to classify texts. As the amount of text available continues to increase, some have stated their belief that manual classification is unsustainable. === Natural language processing === In order to reduce the reliance on legal professionals and the amount of time needed, efforts have been made to create a system to automatically classify legal text and queries. Adequate translation of both would allow accurate information retrieval without the high cost of human classification. These automatic systems generally employ Natural Language Processing (NLP) techniques that are adapted to the legal domain, and also require the creation of a legal ontology. Though multiple systems have been postulated, few have reported results. One system, “SMILE,” which attempted to automatically extract classifications from case texts, resulted in an f-measure (which is a calculation of both recall rate and precision) of under 0.3 (compared to perfect f-measure of 1.0). This is probably much lower than an acceptable rate for general usage. Despite the limited results, many theorists predict that the evolution of such systems will eventually replace manual classification systems. === Citation-Based ranking === In the mid-90s the Room 5 case law retrieval project used citation mining for summaries and ranked its search results based on citation type and count. This slightly pre-dated the PageRank algorithm at Stanford which was also a citation-based ranking. Ranking of results was based

    Read more →
  • Error tolerance (PAC learning)

    Error tolerance (PAC learning)

    In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated. == Notation and the Valiant learning model == In the following, let X {\displaystyle X} be our n {\displaystyle n} -dimensional input space. Let H {\displaystyle {\mathcal {H}}} be a class of functions that we wish to use in order to learn a { 0 , 1 } {\displaystyle \{0,1\}} -valued target function f {\displaystyle f} defined over X {\displaystyle X} . Let D {\displaystyle {\mathcal {D}}} be the distribution of the inputs over X {\displaystyle X} . The goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . Let us suppose we have a function s i z e ( f ) {\displaystyle size(f)} that can measure the complexity of f {\displaystyle f} . Let Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} be an oracle that, whenever called, returns an example x {\displaystyle x} and its correct label f ( x ) {\displaystyle f(x)} . When no noise corrupts the data, we can define learning in the Valiant setting: Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the Valiant setting if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} and 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 ε , 1 δ , n , size ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition error ( h ) ≤ ε {\displaystyle {\text{error}}(h)\leq \varepsilon } . In the following we will define learnability of f {\displaystyle f} when data have suffered some modification. == Classification noise == In the classification noise model a noise rate 0 ≤ η < 1 2 {\displaystyle 0\leq \eta <{\frac {1}{2}}} is introduced. Then, instead of Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} that returns always the correct label of example x {\displaystyle x} , algorithm A {\displaystyle {\mathcal {A}}} can only call a faulty oracle Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} that will flip the label of x {\displaystyle x} with probability η {\displaystyle \eta } . As in the Valiant case, the goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . In applications it is difficult to have access to the real value of η {\displaystyle \eta } , but we assume we have access to its upperbound η B {\displaystyle \eta _{B}} . Note that if we allow the noise rate to be 1 / 2 {\displaystyle 1/2} , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function. Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the classification noise model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 ≤ η ≤ 1 2 {\displaystyle 0\leq \eta \leq {\frac {1}{2}}} , 0 ≤ ε ≤ 1 {\displaystyle 0\leq \varepsilon \leq 1} and 0 ≤ δ ≤ 1 {\displaystyle 0\leq \delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 − 2 η B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } . == Statistical query learning == Statistical Query Learning is a kind of active learning problem in which the learning algorithm A {\displaystyle {\mathcal {A}}} can decide if to request information about the likelihood P f ( x ) {\displaystyle P_{f(x)}} that a function f {\displaystyle f} correctly labels example x {\displaystyle x} , and receives an answer accurate within a tolerance α {\displaystyle \alpha } . Formally, whenever the learning algorithm A {\displaystyle {\mathcal {A}}} calls the oracle Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} , it receives as feedback probability Q f ( x ) {\displaystyle Q_{f(x)}} , such that Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } . Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the statistical query learning model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} and polynomials p ( ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot )} , q ( ⋅ , ⋅ , ⋅ ) {\displaystyle q(\cdot ,\cdot ,\cdot )} , and r ( ⋅ , ⋅ , ⋅ ) {\displaystyle r(\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} the following hold: Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} can evaluate P f ( x ) {\displaystyle P_{f(x)}} in time q ( 1 ε , n , s i z e ( f ) ) {\displaystyle q\left({\frac {1}{\varepsilon }},n,size(f)\right)} ; 1 α {\displaystyle {\frac {1}{\alpha }}} is bounded by r ( 1 ε , n , s i z e ( f ) ) {\displaystyle r\left({\frac {1}{\varepsilon }},n,size(f)\right)} A {\displaystyle {\mathcal {A}}} outputs a model h {\displaystyle h} such that e r r ( h ) < ε {\displaystyle err(h)<\varepsilon } , in a number of calls to the oracle bounded by p ( 1 ε , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},n,size(f)\right)} . Note that the confidence parameter δ {\displaystyle \delta } does not appear in the definition of learning. This is because the main purpose of δ {\displaystyle \delta } is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} always guarantees to meet the approximation criterion Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } , the failure probability is no longer needed. The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable. == Malicious classification == In the malicious classification model an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm A {\displaystyle {\mathcal {A}}} calls an oracle Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} that returns a correctly labeled example x {\displaystyle x} drawn, as usual, from distribution D {\displaystyle {\mathcal {D}}} over the input space with probability 1 − β {\displaystyle 1-\beta } , but it returns with probability β {\displaystyle \beta } an example drawn from a distribution that is not related to D {\displaystyle {\mathcal {D}}} . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of f {\displaystyle f} , β {\displaystyle \beta } , D {\displaystyle {\mathcal {D}}} , or the current progress of the learning algorithm. Definition: Given a bound β B < 1 2 {\displaystyle \beta _{B}<{\frac {1}{2}}} for 0 ≤ β < 1 2 {\displaystyle 0\leq \beta <{\frac {1}{2}}} , we say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the malicious classification model, if there exist a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )}

    Read more →
  • Oja's rule

    Oja's rule

    Oja's learning rule, or simply Oja's rule, named after Finnish computer scientist Erkki Oja (Finnish pronunciation: [ˈojɑ], AW-yuh), is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons. == Theory == Oja's rule requires a number of simplifications to derive, but in its final form it is demonstrably stable, unlike Hebb's rule. It is a single-neuron special case of the Generalized Hebbian Algorithm. However, Oja's rule can also be generalized in other ways to varying degrees of stability and success. === Formula === Consider a simplified model of a neuron y {\displaystyle y} that returns a linear combination of its inputs x using presynaptic weights w: y ( x ) = ∑ j = 1 m x j w j {\displaystyle \,y(\mathbf {x} )~=~\sum _{j=1}^{m}x_{j}w_{j}} Oja's rule defines the change in presynaptic weights w given the output response y {\displaystyle y} of a neuron to its inputs x to be Δ w = w n + 1 − w n = η y n ( x n − y n w n ) , {\displaystyle \,\Delta \mathbf {w} ~=~\mathbf {w} _{n+1}-\mathbf {w} _{n}~=~\eta \,y_{n}(\mathbf {x} _{n}-y_{n}\mathbf {w} _{n}),} where η is the learning rate which can also change with time. Note that the bold symbols are vectors and n defines a discrete time iteration. The rule can also be made for continuous iterations as d w d t = η y ( t ) ( x ( t ) − y ( t ) w ( t ) ) . {\displaystyle \,{\frac {d\mathbf {w} }{dt}}~=~\eta \,y(t)(\mathbf {x} (t)-y(t)\mathbf {w} (t)).} === Derivation === The simplest learning rule known is Hebb's rule, which states in conceptual terms that neurons that fire together, wire together. In component form as a difference equation, it is written Δ w = η y ( x n ) x n {\displaystyle \,\Delta \mathbf {w} ~=~\eta \,y(\mathbf {x} _{n})\mathbf {x} _{n}} , or in scalar form with implicit n-dependence, w i ( n + 1 ) = w i ( n ) + η y ( x ) x i {\displaystyle \,w_{i}(n+1)~=~w_{i}(n)+\eta \,y(\mathbf {x} )x_{i}} , where y(xn) is again the output, this time explicitly dependent on its input vector x. Hebb's rule has synaptic weights approaching infinity with a positive learning rate. We can stop this by normalizing the weights so that each weight's magnitude is restricted between 0, corresponding to no weight, and 1, corresponding to being the only input neuron with any weight. We do this by normalizing the weight vector to be of length one: w i ( n + 1 ) = w i ( n ) + η y ( x ) x i ( ∑ j = 1 m [ w j ( n ) + η y ( x ) x j ] p ) 1 / p {\displaystyle \,w_{i}(n+1)~=~{\frac {w_{i}(n)+\eta \,y(\mathbf {x} )x_{i}}{\left(\sum _{j=1}^{m}[w_{j}(n)+\eta \,y(\mathbf {x} )x_{j}]^{p}\right)^{1/p}}}} . Note that in Oja's original paper, p=2, corresponding to quadrature (root sum of squares), which is the familiar Cartesian normalization rule. However, any type of normalization, even linear, will give the same result without loss of generality. For a small learning rate | η | ≪ 1 {\displaystyle |\eta |\ll 1} the equation can be expanded as a Power series in η {\displaystyle \eta } . w i ( n + 1 ) = w i ( n ) ( ∑ j w j p ( n ) ) 1 / p + η ( y x i ( ∑ j w j p ( n ) ) 1 / p − w i ( n ) ∑ j y x j w j p − 1 ( n ) ( ∑ j w j p ( n ) ) ( 1 + 1 / p ) ) + O ( η 2 ) {\displaystyle \,w_{i}(n+1)~=~{\frac {w_{i}(n)}{\left(\sum _{j}w_{j}^{p}(n)\right)^{1/p}}}~+~\eta \left({\frac {yx_{i}}{\left(\sum _{j}w_{j}^{p}(n)\right)^{1/p}}}-{\frac {w_{i}(n)\sum _{j}yx_{j}w_{j}^{p-1}(n)}{\left(\sum _{j}w_{j}^{p}(n)\right)^{(1+1/p)}}}\right)~+~O(\eta ^{2})} . For small η, our higher-order terms O(η2) go to zero. We again make the specification of a linear neuron, that is, the output of the neuron is equal to the sum of the product of each input and its synaptic weight to the power of p-1, which in the case of p=2 is synaptic weight itself, or y ( x ) = ∑ j = 1 m x j w j p − 1 {\displaystyle \,y(\mathbf {x} )~=~\sum _{j=1}^{m}x_{j}w_{j}^{p-1}} . We also specify that our weights normalize to 1, which will be a necessary condition for stability, so | w | = ( ∑ j = 1 m w j p ) 1 / p = 1 {\displaystyle \,|\mathbf {w} |~=~\left(\sum _{j=1}^{m}w_{j}^{p}\right)^{1/p}~=~1} , which, when substituted into our expansion, gives Oja's rule, or w i ( n + 1 ) = w i ( n ) + η y ( x i − w i ( n ) y ) {\displaystyle \,w_{i}(n+1)~=~w_{i}(n)+\eta \,y(x_{i}-w_{i}(n)y)} . === Stability and PCA === In analyzing the convergence of a single neuron evolving by Oja's rule, one extracts the first principal component, or feature, of a data set. Furthermore, with extensions using the Generalized Hebbian Algorithm, one can create a multi-Oja neural network that can extract as many features as desired, allowing for principal components analysis. A principal component aj is extracted from a dataset x through some associated vector qj, or aj = qj⋅x, and we can restore our original dataset by taking x = ∑ j a j q j {\displaystyle \mathbf {x} ~=~\sum _{j}a_{j}\mathbf {q} _{j}} . In the case of a single neuron trained by Oja's rule, we find the weight vector converges to q1, or the first principal component, as time or number of iterations approaches infinity. We can also define, given a set of input vectors Xi, that its correlation matrix Rij = XiXj has an associated eigenvector given by qj with eigenvalue λj. The variance of outputs of our Oja neuron σ2(n) = ⟨y2(n)⟩ then converges with time iterations to the principal eigenvalue, or lim n → ∞ σ 2 ( n ) = λ 1 {\displaystyle \lim _{n\rightarrow \infty }\sigma ^{2}(n)~=~\lambda _{1}} . These results are derived using Lyapunov function analysis, and they show that Oja's neuron necessarily converges on strictly the first principal component if certain conditions are met in our original learning rule. Most importantly, our learning rate η is allowed to vary with time, but only such that its sum is divergent but its power sum is convergent, that is ∑ n = 1 ∞ η ( n ) = ∞ , ∑ n = 1 ∞ η ( n ) p < ∞ , p > 1 {\displaystyle \sum _{n=1}^{\infty }\eta (n)=\infty ,~~~\sum _{n=1}^{\infty }\eta (n)^{p}<\infty ,~~~p>1} . Our output activation function y(x(n)) is also allowed to be nonlinear and nonstatic, but it must be continuously differentiable in both x and w and have derivatives bounded in time. == Applications == Oja's rule was originally described in Oja's 1982 paper, but the principle of self-organization to which it is applied is first attributed to Alan Turing in 1952. PCA has also had a long history of use before Oja's rule formalized its use in network computation in 1989. The model can thus be applied to any problem of self-organizing mapping, in particular those in which feature extraction is of primary interest. Therefore, Oja's rule has an important place in image and speech processing. It is also useful as it expands easily to higher dimensions of processing, thus being able to integrate multiple outputs quickly. A canonical example is its use in binocular vision. === Biology and Oja's subspace rule === There is clear evidence for both long-term potentiation and long-term depression in biological neural networks, along with a normalization effect in both input weights and neuron outputs. However, while there is no direct experimental evidence yet of Oja's rule active in a biological neural network, a biophysical derivation of a generalization of the rule is possible. Such a derivation requires retrograde signalling from the postsynaptic neuron, which is biologically plausible (see neural backpropagation), and takes the form of Δ w i j ∝ ⟨ x i y j ⟩ − ϵ ⟨ ( c p r e ∗ ∑ k w i k y k ) ⋅ ( c p o s t ∗ y j ) ⟩ , {\displaystyle \Delta w_{ij}~\propto ~\langle x_{i}y_{j}\rangle -\epsilon \left\langle \left(c_{\mathrm {pre} }\sum _{k}w_{ik}y_{k}\right)\cdot \left(c_{\mathrm {post} }y_{j}\right)\right\rangle ,} where as before wij is the synaptic weight between the ith input and jth output neurons, x is the input, y is the postsynaptic output, and we define ε to be a constant analogous the learning rate, and cpre and cpost are presynaptic and postsynaptic functions that model the weakening of signals over time. Note that the angle brackets denote the average and the ∗ operator is a convolution. By taking the pre- and post-synaptic functions into frequency space and combining integration terms with the convolution, we find that this gives an arbitrary-dimensional generalization of Oja's rule known as Oja's Subspace, namely Δ w = C x ⋅ w − w ⋅ C y . {\displaystyle \Delta w~=~Cx\cdot w-w\cdot Cy.}

    Read more →