List of algorithm general topics

List of algorithm general topics

This is a list of algorithm general topics. Analysis of algorithms Ant colony algorithm Approximation algorithm Best and worst cases Big O notation Combinatorial search Competitive analysis Computability theory Computational complexity theory Embarrassingly parallel problem Emergent algorithm Evolutionary algorithm Fast Fourier transform Genetic algorithm Graph exploration algorithm Heuristic Hill climbing Implementation Las Vegas algorithm Lock-free and wait-free algorithms Monte Carlo algorithm Numerical analysis Online algorithm Polynomial time approximation scheme Problem size Pseudorandom number generator Quantum algorithm Random-restart hill climbing Randomized algorithm Running time Sorting algorithm Search algorithm Stable algorithm (disambiguation) Super-recursive algorithm Tree search algorithm

G'MIC

G'MIC (GREYC's Magic for Image Computing) is a free and open-source framework for image processing. It defines a script language that allows the creation of complex macros. Originally usable only through a command line interface, it is currently mostly popular as a GIMP plugin, and is also included in Krita. G'MIC is dual-licensed under CECILL-2.1 or CECILL-C. == Features == G'MIC's graphical interface is notable for its noise removal filters, which came from an earlier project called GREYCstoration by the same authors. G'MIC offers many built-in commands for image processing, including basic mathematical manipulations, look up tables, and filtering operations. More complex macros and pipelines built out of those commands are defined in its library files. == Interpreters == === Command line === G'MIC is primarily a script language callable from a shell. For example, to display an image: This command displays the image contained in the file image.jpg and allows zooming in to examine values. Several filters can be applied in succession. For example, to crop and resize an image: === Graphical interface === G'MIC comes with a Qt-based graphical interface, which may be integrated as a Gimp or Krita plugin. It contains several hundred filters written in the G'MIC language, dynamically updated through an internet feed. The interface provides a preview and setting sliders for each filter. G'MIC is one of the most popular Gimp plugins. === G'MIC Online === Most of the filters available for the graphical interface are also available online. === ZArt === ZArt is a graphical interface for real-time manipulation of webcam images. === libgmic === Libgmic is a C++ library that can be linked to third-party applications. It sees integration in Flowblade and Veejay.

Huber loss

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. == Definition == The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by L δ ( a ) = { 1 2 a 2 for | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(a)={\begin{cases}{\frac {1}{2}}{a^{2}}&{\text{for }}|a|\leq \delta ,\\[4pt]\delta \cdot \left(|a|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ {\displaystyle |a|=\delta } . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) {\displaystyle a=y-f(x)} , so the former can be expanded to L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for | y − f ( x ) | ≤ δ , δ ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(y,f(x))={\begin{cases}{\frac {1}{2}}{\left(y-f(x)\right)}^{2}&{\text{for }}\left|y-f(x)\right|\leq \delta ,\\[4pt]\delta \ \cdot \left(\left|y-f(x)\right|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin. == Motivation == Two very commonly used loss functions are the squared loss, L ( a ) = a 2 {\displaystyle L(a)=a^{2}} , and the absolute loss, L ( a ) = | a | {\displaystyle L(a)=|a|} . The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of a {\displaystyle a} 's (as in ∑ i = 1 n L ( a i ) {\textstyle \sum _{i=1}^{n}L(a_{i})} ), the sample mean is influenced too much by a few particularly large a {\displaystyle a} -values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions. As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum a = 0 {\displaystyle a=0} ; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points a = − δ {\displaystyle a=-\delta } and a = δ {\displaystyle a=\delta } . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function). == Pseudo-Huber loss function == The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the δ {\displaystyle \delta } value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as L δ ( a ) = δ 2 ( 1 + ( a / δ ) 2 − 1 ) . {\displaystyle L_{\delta }(a)=\delta ^{2}\left({\sqrt {1+(a/\delta )^{2}}}-1\right).} As such, this function approximates a 2 / 2 {\displaystyle a^{2}/2} for small values of a {\displaystyle a} , and approximates a straight line with slope δ {\displaystyle \delta } for large values of a {\displaystyle a} . While the above is the most common form, other smooth approximations of the Huber loss function also exist. == Variant for classification == For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction f ( x ) {\displaystyle f(x)} (a real-valued classifier score) and a true binary class label y ∈ { + 1 , − 1 } {\displaystyle y\in \{+1,-1\}} , the modified Huber loss is defined as L ( y , f ( x ) ) = { max ( 0 , 1 − y f ( x ) ) 2 for y f ( x ) > − 1 , − 4 y f ( x ) otherwise. {\displaystyle L(y,f(x))={\begin{cases}\max(0,1-y\,f(x))^{2}&{\text{for }}\,\,y\,f(x)>-1,\\[4pt]-4y\,f(x)&{\text{otherwise.}}\end{cases}}} The term max ( 0 , 1 − y f ( x ) ) {\displaystyle \max(0,1-y\,f(x))} is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of L {\displaystyle L} . == Applications == The Huber loss function is used in robust statistics, M-estimation and additive modelling.

T-distributed stochastic neighbor embedding

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP. t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing. For a data set with n {\displaystyle n} elements, t-SNE runs in O ( n 2 ) {\displaystyle O(n^{2})} time and requires O ( n 2 ) {\displaystyle O(n^{2})} space. == Details == Given a set of N {\displaystyle N} high-dimensional objects x 1 , … , x N {\displaystyle \mathbf {x} _{1},\dots ,\mathbf {x} _{N}} , t-SNE first computes probabilities p i j {\displaystyle p_{ij}} that are proportional to the similarity of objects x i {\displaystyle \mathbf {x} _{i}} and x j {\displaystyle \mathbf {x} _{j}} , as follows. For i ≠ j {\displaystyle i\neq j} , define p j ∣ i = exp ⁡ ( − ‖ x i − x j ‖ 2 / 2 σ i 2 ) ∑ k ≠ i exp ⁡ ( − ‖ x i − x k ‖ 2 / 2 σ i 2 ) {\displaystyle p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}} and set p i ∣ i = 0 {\displaystyle p_{i\mid i}=0} . Note the above denominator ensures ∑ j p j ∣ i = 1 {\displaystyle \sum _{j}p_{j\mid i}=1} for all i {\displaystyle i} . As van der Maaten and Hinton explained: "The similarity of datapoint x j {\displaystyle x_{j}} to datapoint x i {\displaystyle x_{i}} is the conditional probability, p j | i {\displaystyle p_{j|i}} , that x i {\displaystyle x_{i}} would pick x j {\displaystyle x_{j}} as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i {\displaystyle x_{i}} ." Now define p i j = p j ∣ i + p i ∣ j 2 N {\displaystyle p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}} This is motivated because p i {\displaystyle p_{i}} and p j {\displaystyle p_{j}} from the N samples are estimated as 1/N, so the conditional probability can be written as p i ∣ j = N p i j {\displaystyle p_{i\mid j}=Np_{ij}} and p j ∣ i = N p j i {\displaystyle p_{j\mid i}=Np_{ji}} . Since p i j = p j i {\displaystyle p_{ij}=p_{ji}} , you can obtain previous formula. Also note that p i i = 0 {\displaystyle p_{ii}=0} and ∑ i , j p i j = 1 {\displaystyle \sum _{i,j}p_{ij}=1} . The bandwidth of the Gaussian kernels σ i {\displaystyle \sigma _{i}} is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of σ i {\displaystyle \sigma _{i}} are used in denser parts of the data space. The entropy increases with the perplexity of this distribution P i {\displaystyle P_{i}} ; this relation is seen as P e r p ( P i ) = 2 H ( P i ) {\displaystyle Perp(P_{i})=2^{H(P_{i})}} where H ( P i ) {\displaystyle H(P_{i})} is the Shannon entropy H ( P i ) = − ∑ j p j | i log 2 ⁡ p j | i . {\displaystyle H(P_{i})=-\sum _{j}p_{j|i}\log _{2}p_{j|i}.} The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.". Since the Gaussian kernel uses the Euclidean distance ‖ x i − x j ‖ {\displaystyle \lVert x_{i}-x_{j}\rVert } , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the p i j {\displaystyle p_{ij}} become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a d {\displaystyle d} -dimensional map y 1 , … , y N {\displaystyle \mathbf {y} _{1},\dots ,\mathbf {y} _{N}} (with y i ∈ R d {\displaystyle \mathbf {y} _{i}\in \mathbb {R} ^{d}} and d {\displaystyle d} typically chosen as 2 or 3) that reflects the similarities p i j {\displaystyle p_{ij}} as well as possible. To this end, it measures similarities q i j {\displaystyle q_{ij}} between two points in the map y i {\displaystyle \mathbf {y} _{i}} and y j {\displaystyle \mathbf {y} _{j}} , using a very similar approach. Specifically, for i ≠ j {\displaystyle i\neq j} , define q i j {\displaystyle q_{ij}} as q i j = ( 1 + ‖ y i − y j ‖ 2 ) − 1 ∑ k ∑ l ≠ k ( 1 + ‖ y k − y l ‖ 2 ) − 1 {\displaystyle q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}} and set q i i = 0 {\displaystyle q_{ii}=0} . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points y i {\displaystyle \mathbf {y} _{i}} in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution P {\displaystyle P} from the distribution Q {\displaystyle Q} , that is: K L ( P ∥ Q ) = ∑ i ≠ j p i j log ⁡ p i j q i j {\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}} The minimization of the Kullback–Leibler divergence with respect to the points y i {\displaystyle \mathbf {y} _{i}} is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs. == Output == While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering. == Software == A C++ implementation of Barnes-Hut is available on the github account of one of the original authors. The R package Rtsne implements t-SNE in R. ELKI contains tSNE, also with Barnes-Hut approximation scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version) The Julia package TSne implements t-SNE

Universal approximation theorem

In the field of machine learning, the universal approximation theorems (UATs) state that neural networks with a certain structure can, in principle, approximate any continuous function to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data. The best-known version of the theorem applies to feedforward networks with a single hidden layer. It states that if the layer's activation function is non-polynomial (which is true for common choices like the sigmoid function or ReLU), then the network can act as a "universal approximator." Universality is achieved by increasing the number of neurons in the hidden layer, making the network "wider." Other versions of the theorem show that universality can also be achieved by keeping the network's width fixed but increasing its number of layers, making it "deeper." These are existence theorems. They guarantee that a network with the right structure exists, but they do not provide a method for finding the network's parameters (training it), nor do they specify exactly how large the network must be for a given function. Finding a suitable network remains a practical challenge that is typically addressed with optimization algorithms like backpropagation. == Setup == Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. Most universal approximation theorems are in one of two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("bounded depth and bounded width" case). == History == === Arbitrary width === The first results concerned the arbitrary width case. Ken-ichi Funahashi (May 1989) showed that Rumelhart–Hinton–Williams type backpropagation networks possess universal approximation capability with a class of sigmoidal activation functions, extending the result to multi-output mappings as well. Kurt Hornik, Maxwell Stinchcombe, and Halbert White (July 1989) showed that multilayer feed-forward networks with as few as one hidden layer are universal approximators, provided that the activation function satisfies certain conditions. George Cybenko (December 1989) independently established a related result for sigmoid activation functions using functional-analytic methods. Hornik also showed in 1991 that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993 and later Allan Pinkus in 1999 showed that the universal approximation property is equivalent to having a nonpolynomial activation function. === Arbitrary depth === The arbitrary depth case was also studied by a number of authors such as Gustaf Gripenberg in 2003, Dmitry Yarotsky, Zhou Lu et al in 2017, Boris Hanin and Mark Sellke in 2018 who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons extended those results to neural networks with general activation functions such, e.g. tanh or GeLU. One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings. === Bounded depth and bounded width === The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999. They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. In 2018, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with fewer units in hidden layers. In 2018, they also constructed single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions. In 2022, Shen et al. obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks. === Quantitative bounds === The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of Lp functions using feed-forward neural networks with ReLU as activation functions. Similar results that can be directly applied to residual neural networks were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using control-theoretic arguments. In 2023, Cai obtained the optimal minimum width bound for the universal approximation. For the arbitrary depth case, Leonie Papon and Anastasis Kratsios derived explicit depth estimates depending on the regularity of the target function and of the activation function. === Kolmogorov network === The Kolmogorov–Arnold representation theorem is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem. Robert Hecht-Nielsen showed that a three-layer neural network can approximate any continuous multivariate function. This was extended to the discontinuous case by Vugar Ismailov. In 2024, Ziming Liu and co-authors showed a practical application. === Reservoir computing and quantum reservoir computing === In reservoir computing a sparse recurrent neural network with fixed weights equipped of fading memory and echo state property is followed by a trainable output layer. Its universality has been demonstrated separately for what concerns networks of rate neurons and spiking neurons, respectively. In 2024, the framework has been generalized and extended to quantum reservoirs where the reservoir is based on qubits defined over Hilbert spaces. === Variants === Variants include discontinuous activation functions, noncompact domains, certifiable networks, random neural networks, and alternative network architectures and topologies. The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. For input dimension d x {\displaystyle d_{x}} and output dimension d y {\displaystyle d_{y}} the minimum width required for the universal approximation of the Lp functions is exactly m a x { d x + 1 , d y } {\displaystyle max\{d_{x}+1,d_{y}\}} (for a ReLU network). More generally this also holds if both ReLU and a threshold activation function are used. Universal function approximation on graphs (or rather on graph isomorphism classes) by popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test. In 2020, a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying O ( | V | ⋅ | E | ) {\displaystyle {\mathcal {O}}(\left|V\right|\cdot \left|E\right|)} -runtime method that performed at state of the art on a collection of benchmarks (where V {\displaystyle V} and E {\displaystyle E} are the sets of nodes and edges of the graph respectively). There are also a variety of results between non-Euclidean spaces and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture, radial basis functions, or neural networks with specific properties. == Arbitrary-width case == A universal approximation theorem formally states that a family of neural network funct

Image-based modeling and rendering

In computer graphics and computer vision, image-based modeling and rendering (IBMR) methods rely on a set of two-dimensional images of a scene to generate a three-dimensional model and then render some novel views of this scene. The traditional approach of computer graphics has been used to create a geometric model in 3D and try to reproject it onto a two-dimensional image. Computer vision, conversely, is mostly focused on detecting, grouping, and extracting features (edges, faces, etc.) present in a given picture and then trying to interpret them as three-dimensional clues. Image-based modeling and rendering allows the use of multiple two-dimensional images in order to generate directly novel two-dimensional images, skipping the manual modeling stage. == Light modeling == Instead of considering only the physical model of a solid, IBMR methods usually focus more on light modeling. The fundamental concept behind IBMR is the plenoptic illumination function which is a parametrisation of the light field. The plenoptic function describes the light rays contained in a given volume. It can be represented with seven dimensions: a ray is defined by its position ( x , y , z ) {\displaystyle (x,y,z)} , its orientation ( θ , ϕ ) {\displaystyle (\theta ,\phi )} , its wavelength ( λ ) {\displaystyle (\lambda )} and its time ( t ) {\displaystyle (t)} : P ( x , y , z , θ , ϕ , λ , t ) {\displaystyle P(x,y,z,\theta ,\phi ,\lambda ,t)} . IBMR methods try to approximate the plenoptic function to render a novel set of two-dimensional images from another. Given the high dimensionality of this function, practical methods place constraints on the parameters in order to reduce this number (typically to 2 to 4). == IBMR methods and algorithms == View morphing generates a transition between images Panoramic imaging renders panoramas using image mosaics of individual still images Lumigraph relies on a dense sampling of a scene Space carving generates a 3D model based on a photo-consistency check

State–action–reward–state–action

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote. This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S1", the action the agent chooses "A1", the reward "R2" the agent gets for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" the agent chooses in its new state. The acronym for the quintuple (St, At, Rt+1, St+1, At+1) is SARSA. Some authors use a slightly different convention and write the quintuple (St, At, Rt, St+1, At+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention. == Algorithm == Q new ( S t , A t ) ← ( 1 − α ) Q ( S t , A t ) + α [ R t + 1 + γ Q ( S t + 1 , A t + 1 ) ] {\displaystyle Q^{\textrm {new}}(S_{t},A_{t})\leftarrow (1-\alpha )Q(S_{t},A_{t})+\alpha \,[R_{t+1}+\gamma \,Q(S_{t+1},A_{t+1})]} A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation. Watkin's Q-learning updates an estimate of the optimal state-action value function Q ∗ {\displaystyle Q^{}} based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy. Some optimizations of Watkin's Q-learning may be applied to SARSA. == Hyperparameters == === Learning rate (alpha) === The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. === Discount factor (gamma) === The discount factor determines the importance of future rewards. A discount factor of 0 makes the agent "opportunistic", or "myopic", e.g., by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the Q {\displaystyle Q} values may diverge. === Initial conditions (Q(S0, A0)) === Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high (infinite) initial value, also known as "optimistic initial conditions", can encourage exploration: no matter what action takes place, the update rule causes it to have higher values than the other alternative, thus increasing their choice probability. In 2013 it was suggested that the first reward r {\displaystyle r} could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of Q {\displaystyle Q} . This allows immediate learning in case of fixed deterministic rewards. This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.