Dissociated press is a parody generator (a computer program that generates nonsensical text). The generated text is based on another text using the Markov chain technique. The name is a play on "Associated Press" and the psychological term dissociation (although word salad is more typical of conditions like aphasia and schizophrenia – which is, however, frequently confused with dissociative identity disorder by laypeople). An implementation of the algorithm is available in Emacs. Another implementation is available as a Perl module in CPAN, Games::Dissociate. == The algorithm == The algorithm starts by printing a number of consecutive words (or letters) from the source text. Then it searches the source text for an occurrence of the few last words or letters printed out so far. If multiple occurrences are found, it picks a random one, and proceeds with printing the text following the chosen occurrence. After a predetermined length of text is printed out, the search procedure is repeated for the newly printed ending. Considering that words and phrases tend to appear in specific grammatical contexts, the resulting text usually seems correct grammatically, and if the source text is uniform in style, the result appears to be of similar style and subject, and takes some effort on the reader's side to recognize as not genuine. Still, the randomness of the assembly process deprives it of any logical flow - the loosely related parts are connected in a nonsensical way, creating a humorously abstract, random result. == Examples == Here is a short example of word-based Dissociated Press applied to the Jargon File: wart: n. A small, crocky feature that sticks out of an array (C has no checks for this). This is relatively benign and easy to spot if the phrase is bent so as to be not worth paying attention to the medium in question. Here is a short example of letter-based Dissociated Press applied to the same source: window sysIWYG: n. A bit was named aften /bee´t@/ prefer to use the other guy's re, especially in every cast a chuckle on neithout getting into useful informash speech makes removing a featuring a move or usage actual abstractionsidered interj. Indeed spectace logic or problem! == History == The dissociated press algorithm is described in HAKMEM (1972) Item #176. The name "dissociated press" is first known to have been associated with the Emacs implementation. Brian Hayes discussed a Travesty algorithm in Scientific American in November 1983. The article provided a garbled William Faulkner passage: When he got on the table, he come in. He never come out of my own pocket as a measure of protecting the company against riot and bloodshed. And when he said. "You tell me a bus ticket, let alone write out no case histories. Then the law come back with a knife!" Hugh Kenner and Joseph O'Rourke of Johns Hopkins University discussed their frequency table-based Travesty generator for microcomputers in BYTE in November 1984. The article included the Turbo Pascal source for two versions of the generator, one using Hayes' algorithm and another using Claude Shannon's Hellbat algorithm. Murray Lesser offered a compiled BASIC version in the magazine in July 1985, in September 1985 Peter Wayner offered a version that used tree data structures instead of frequency tables, and in December 1985 Neil J. Rubenking offered a version written in Turbo Pascal that stored frequency information in a B-tree.
Prequel (mobile application)
Prequel, Inc. is an American technology company and mobile app developer known for developing the Prequel mobile application, which enables editing photos and videos with filters and effects generated using artificial intelligence. Prequel was founded in 2018 by Serge Aliseenko and Timur Khabirov, who currently serves as the company's CEO. It is headquartered in New York City. As of August 2022, it had been downloaded more than 100 million times. == History == In 2016, entrepreneur Timur Khabirov and investor Serge Aliseenko registered a US corporation named AIAR Labs Inc, which was developing AR solutions as an outsourced contractor. Of several proprietary products, Prequel was selected for beta-testing as a product focused on editing photos and videos. In 2018, Prequel was released on the Apple App Store. The launch cost $3 million USD, financed with the founders’ personal funds. The first release included approximately 10 filters for photos and the same amount of effects that augmented images with rose petals, rain and snow, VHS and film reel simulations, glitch, grain, sun puddles, and lomography. By June 2020, the app had also been released for Android. In 2021, Prequel founders Timur Khabirov and Serge Aliseenko launched a venture studio for startups working with artificial, computer vision, and AR-based visual art. In December 2022, Prequel reached the number 14 slot on the global rankings for Apple App Store’s Top Charts and the number 5 slot on the App Store’s U.S. charts. In March 2023, Prequel launched a new app called Artique, which is an AI-powered image editing app for businesses. Artique provides advertising and marketing graphic design using ready-made templates that users can customize, while giving suggestions and visual cues through artificial intelligence. Prequel was also one of the companies participating in discussions about artificial intelligence at SXSW 2023. == Features == Prequel describes its app as an "Aesthetic Pic Editor. The app uses artificial intelligence to create and edit content. Prequel can be used to touch up faces on images and videos and can also tie various decorative elements to certain points on the human body and face. Prequel filters include the "Cartoon" filter, which converts selfies into cartoon-style pictures. Other filters include Kidcore, Dust, Grain, Fisheye, Retro Style, Miami, Disco, and VHS-style filters, as well as the ability to create Renaissance-style pictures. Prequel also gives users the ability to apply color correction tools and to make moving images with 3D effects out of 2D images. Prequel allows users to take photos and videos directly through the app and apply filters and effects in real time. The app also comes with manual editing options for photos, such as adjusting the brightness and/or exposure and cropping photos, as well as an option to automatically apply adjustments. The Prequel app uses the Core ML, MNN, and TFLight frameworks to work with its neural networks. Some AI solutions are launched server-side, and some on the user's mobile device. A resulting photo or video edited with the app is called "a prequel." The app daily generates over 2 million such prequels, which are published by users in Instagram, TikTok, and other social media. As of 2022, the app has more than 800 filters and effects, along with video templates and support for GIFs and stickers. Prequel is free-to-use, but has a premium version that gives users access to more effects, filters, and beauty tools. Since its launch in 2018, Prequel has been downloaded more than 100 million times.
Mathematics of neural networks in machine learning
An artificial neural network (ANN) or neural network combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways. == Structure == === Neuron === A neuron with label j {\displaystyle j} receiving an input p j ( t ) {\displaystyle p_{j}(t)} from predecessor neurons consists of the following components: an activation a j ( t ) {\displaystyle a_{j}(t)} , the neuron's state, depending on a discrete time parameter, an optional threshold θ j {\displaystyle \theta _{j}} , which stays fixed unless changed by learning, an activation function f {\displaystyle f} that computes the new activation at a given time t + 1 {\displaystyle t+1} from a j ( t ) {\displaystyle a_{j}(t)} , θ j {\displaystyle \theta _{j}} and the net input p j ( t ) {\displaystyle p_{j}(t)} giving rise to the relation a j ( t + 1 ) = f ( a j ( t ) , p j ( t ) , θ j ) , {\displaystyle a_{j}(t+1)=f(a_{j}(t),p_{j}(t),\theta _{j}),} and an output function f out {\displaystyle f_{\text{out}}} computing the output from the activation o j ( t ) = f out ( a j ( t ) ) . {\displaystyle o_{j}(t)=f_{\text{out}}(a_{j}(t)).} Often the output function is simply the identity function. An input neuron has no predecessor but serves as input interface for the whole network. Similarly an output neuron has no successor and thus serves as output interface of the whole network. === Propagation function === The propagation function computes the input p j ( t ) {\displaystyle p_{j}(t)} to the neuron j {\displaystyle j} from the outputs o i ( t ) {\displaystyle o_{i}(t)} and typically has the form p j ( t ) = ∑ i o i ( t ) w i j . {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}.} === Bias === A bias term can be added, changing the form to the following: p j ( t ) = ∑ i o i ( t ) w i j + w 0 j , {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}+w_{0j},} where w 0 j {\displaystyle w_{0j}} is a bias. == Neural networks as functions == Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} or a distribution over X {\displaystyle \textstyle X} or both X {\displaystyle \textstyle X} and Y {\displaystyle \textstyle Y} . Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function f ( x ) {\displaystyle \textstyle f(x)} is defined as a composition of other functions g i ( x ) {\displaystyle \textstyle g_{i}(x)} , that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the nonlinear weighted sum, where f ( x ) = K ( ∑ i w i g i ( x ) ) {\displaystyle \textstyle f(x)=K\left(\sum _{i}w_{i}g_{i}(x)\right)} , where K {\displaystyle \textstyle K} (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent, sigmoid function, softmax function, or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions g i {\displaystyle \textstyle g_{i}} as a vector g = ( g 1 , g 2 , … , g n ) {\displaystyle \textstyle g=(g_{1},g_{2},\ldots ,g_{n})} . This figure depicts such a decomposition of f {\displaystyle \textstyle f} , with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x {\displaystyle \textstyle x} is transformed into a 3-dimensional vector h {\displaystyle \textstyle h} , which is then transformed into a 2-dimensional vector g {\displaystyle \textstyle g} , which is finally transformed into f {\displaystyle \textstyle f} . This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F = f ( G ) {\displaystyle \textstyle F=f(G)} depends upon the random variable G = g ( H ) {\displaystyle \textstyle G=g(H)} , which depends upon H = h ( X ) {\displaystyle \textstyle H=h(X)} , which depends upon the random variable X {\displaystyle \textstyle X} . This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of g {\displaystyle \textstyle g} are independent of each other given their input h {\displaystyle \textstyle h} ). This naturally enables a degree of parallelism in the implementation. Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where f {\displaystyle \textstyle f} is shown as dependent upon itself. However, an implied temporal dependence is not shown. == Backpropagation == Backpropagation training algorithms fall into three categories: steepest descent (with variable learning rate and momentum, resilient backpropagation); quasi-Newton (Broyden–Fletcher–Goldfarb–Shanno, one step secant); Levenberg–Marquardt and conjugate gradient (Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient). === Algorithm === Let N {\displaystyle N} be a network with e {\displaystyle e} connections, m {\displaystyle m} inputs and n {\displaystyle n} outputs. Below, x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } denote vectors in R m {\displaystyle \mathbb {R} ^{m}} , y 1 , y 2 , … {\displaystyle y_{1},y_{2},\dots } vectors in R n {\displaystyle \mathbb {R} ^{n}} , and w 0 , w 1 , w 2 , … {\displaystyle w_{0},w_{1},w_{2},\ldots } vectors in R e {\displaystyle \mathbb {R} ^{e}} . These are called inputs, outputs and weights, respectively. The network corresponds to a function y = f N ( w , x ) {\displaystyle y=f_{N}(w,x)} which, given a weight w {\displaystyle w} , maps an input x {\displaystyle x} to an output y {\displaystyle y} . In supervised learning, a sequence of training examples ( x 1 , y 1 ) , … , ( x p , y p ) {\displaystyle (x_{1},y_{1}),\dots ,(x_{p},y_{p})} produces a sequence of weights w 0 , w 1 , … , w p {\displaystyle w_{0},w_{1},\dots ,w_{p}} starting from some initial weight w 0 {\displaystyle w_{0}} , usually chosen at random. These weights are computed in turn: first compute w i {\displaystyle w_{i}} using only ( x i , y i , w i − 1 ) {\displaystyle (x_{i},y_{i},w_{i-1})} for i = 1 , … , p {\displaystyle i=1,\dots ,p} . The output of the algorithm is then w p {\displaystyle w_{p}} , giving a new function x ↦ f N ( w p , x ) {\displaystyle x\mapsto f_{N}(w_{p},x)} . The computation is the same in each step, hence only the case i = 1 {\displaystyle i=1} is described. w 1 {\displaystyle w_{1}} is calculated from ( x 1 , y 1 , w 0 ) {\displaystyle (x_{1},y_{1},w_{0})} by considering a variable weight w {\displaystyle w} and applying gradient descent to the function w ↦ E ( f N ( w , x 1 ) , y 1 ) {\displaystyle w\mapsto E(f_{N}(w,x_{1}),y_{1})} to find a local minimum, starting at w = w 0 {\displaystyle w=w_{0}} . This makes w 1 {\displaystyle w_{1}} the minimizing weight found by gradient descent. == Learning pseudocode == To implement the algorithm above, explicit formulas are required for the gradient of the function w ↦ E ( f N ( w , x ) , y ) {\displaystyle w\mapsto E(f_{N}(w,x),y)} where the function is E ( y , y ′ ) = | y − y ′ | 2 {\displaystyle E(y,y')=|y-y'|^{2}} . The learning algorithm can be divided into two phases: propagation and weight update. === Propagation === Propagation involves the following steps: Propagation forward through the network to generate the output value(s) Calculation of the cost (error term) Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. === Weight update === For each weight: Multiply the weight's output delta and input activation to find the gradient of the weight. Subtract the ratio (percentage) of the weight's gradient from the weight. The learning rate is the ratio (percentage) that influences the speed and quality of learning. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurat
Diffusion model
In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality. There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using variational inference. The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers. As of 2024, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise. The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation. Other than computer vision, diffusion models have also found applications in natural language processing such as text generation and summarization, sound generation, and reinforcement learning. == Denoising diffusion model == === Non-equilibrium thermodynamics === Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion. Consider, for example, how one might model the distribution of all naturally occurring photos. Each image is a point in the space of all images, and the distribution of naturally occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with pdf ρ ( x ) ∝ e − 1 2 ‖ x ‖ 2 {\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}} . This is just the Maxwell–Boltzmann distribution of particles in a potential well V ( x ) = 1 2 ‖ x ‖ 2 {\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. === Denoising Diffusion Probabilistic Model (DDPM) === The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference. ==== Forward diffusion ==== To present the model, some notation is required. β 1 , . . . , β T ∈ ( 0 , 1 ) {\displaystyle \beta _{1},...,\beta _{T}\in (0,1)} are fixed constants. α t := 1 − β t {\displaystyle \alpha _{t}:=1-\beta _{t}} α ¯ t := α 1 ⋯ α t {\displaystyle {\bar {\alpha }}_{t}:=\alpha _{1}\cdots \alpha _{t}} σ t := 1 − α ¯ t {\displaystyle \sigma _{t}:={\sqrt {1-{\bar {\alpha }}_{t}}}} σ ~ t := σ t − 1 σ t β t {\displaystyle {\tilde {\sigma }}_{t}:={\frac {\sigma _{t-1}}{\sigma _{t}}}{\sqrt {\beta _{t}}}} μ ~ t ( x t , x 0 ) := α t ( 1 − α ¯ t − 1 ) x t + α ¯ t − 1 ( 1 − α t ) x 0 σ t 2 {\displaystyle {\tilde {\mu }}_{t}(x_{t},x_{0}):={\frac {{\sqrt {\alpha _{t}}}(1-{\bar {\alpha }}_{t-1})x_{t}+{\sqrt {{\bar {\alpha }}_{t-1}}}(1-\alpha _{t})x_{0}}{\sigma _{t}^{2}}}} N ( μ , Σ ) {\displaystyle {\mathcal {N}}(\mu ,\Sigma )} is the normal distribution with mean μ {\displaystyle \mu } and variance Σ {\displaystyle \Sigma } , and N ( x | μ , Σ ) {\displaystyle {\mathcal {N}}(x|\mu ,\Sigma )} is the probability density at x {\displaystyle x} . A vertical bar denotes conditioning. A forward diffusion process starts at some starting point x 0 ∼ q {\displaystyle x_{0}\sim q} , where q {\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it by x t = 1 − β t x t − 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}} where z 1 , . . . , z T {\displaystyle z_{1},...,z_{T}} are IID (Independent and identically distributed random variables) samples from N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The coefficients 1 − β t {\displaystyle {\sqrt {1-\beta _{t}}}} and β t {\displaystyle {\sqrt {\beta _{t}}}} ensure that Var ( X t ) = I {\displaystyle {\mbox{Var}}(X_{t})=I} assuming that Var ( X 0 ) = I {\displaystyle {\mbox{Var}}(X_{0})=I} . The values of β t {\displaystyle \beta _{t}} are chosen such that for any starting distribution of x 0 {\displaystyle x_{0}} , if it has finite second moment, then lim t → ∞ x t | x 0 {\displaystyle \lim _{t\to \infty }x_{t}|x_{0}} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The entire diffusion process then satisfies q ( x 0 : T ) = q ( x 0 ) q ( x 1 | x 0 ) ⋯ q ( x T | x T − 1 ) = q ( x 0 ) N ( x 1 | α 1 x 0 , β 1 I ) ⋯ N ( x T | α T x T − 1 , β T I ) {\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0}){\mathcal {N}}(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots {\mathcal {N}}(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)} or ln q ( x 0 : T ) = ln q ( x 0 ) − ∑ t = 1 T 1 2 β t ‖ x t − 1 − β t x t − 1 ‖ 2 + C {\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C} where C {\displaystyle C} is a normalization constant and often omitted. In particular, we note that x 1 : T | x 0 {\displaystyle x_{1:T}|x_{0}} is a Gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with Gaussian process, x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} x t − 1 | x t , x 0 ∼ N ( μ ~ t ( x t , x 0 ) , σ ~ t 2 I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)} In particular, notice that for large t {\displaystyle t} , the variable x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . That is, after a long enough diffusion process, we end up with some x T {\displaystyle x_{T}} that is very close to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with all traces of the original x 0 ∼ q {\displaystyle x_{0}\sim q} gone. For example, since x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} we can sample x t | x 0 {\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all the intermediate steps x 1 , x 2 , . . . , x t − 1 {\displaystyle x_{1},x_{2},...,x_{t-1}} . ==== Backward diffusion ==== The key idea of DDPM is to use a neural network parametrized by θ {\displaystyle \theta } . The network takes in two arguments x t , t {\displaystyle x_{t},t} , and outputs a vector μ θ ( x t , t ) {\displaystyle \mu _{\theta }(x_{t},t)} and a matrix Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} , such that each step in the forward diffusion process can be approximately undone by x t − 1 ∼ N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle x_{t-1}\sim {\mathcal {N}}(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} . This then gives us a backward diffusion process p θ {\displaystyle p_{\theta }} defined by p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x
Error tolerance (PAC learning)
In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated. == Notation and the Valiant learning model == In the following, let X {\displaystyle X} be our n {\displaystyle n} -dimensional input space. Let H {\displaystyle {\mathcal {H}}} be a class of functions that we wish to use in order to learn a { 0 , 1 } {\displaystyle \{0,1\}} -valued target function f {\displaystyle f} defined over X {\displaystyle X} . Let D {\displaystyle {\mathcal {D}}} be the distribution of the inputs over X {\displaystyle X} . The goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . Let us suppose we have a function s i z e ( f ) {\displaystyle size(f)} that can measure the complexity of f {\displaystyle f} . Let Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} be an oracle that, whenever called, returns an example x {\displaystyle x} and its correct label f ( x ) {\displaystyle f(x)} . When no noise corrupts the data, we can define learning in the Valiant setting: Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the Valiant setting if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} and 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 ε , 1 δ , n , size ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition error ( h ) ≤ ε {\displaystyle {\text{error}}(h)\leq \varepsilon } . In the following we will define learnability of f {\displaystyle f} when data have suffered some modification. == Classification noise == In the classification noise model a noise rate 0 ≤ η < 1 2 {\displaystyle 0\leq \eta <{\frac {1}{2}}} is introduced. Then, instead of Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} that returns always the correct label of example x {\displaystyle x} , algorithm A {\displaystyle {\mathcal {A}}} can only call a faulty oracle Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} that will flip the label of x {\displaystyle x} with probability η {\displaystyle \eta } . As in the Valiant case, the goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . In applications it is difficult to have access to the real value of η {\displaystyle \eta } , but we assume we have access to its upperbound η B {\displaystyle \eta _{B}} . Note that if we allow the noise rate to be 1 / 2 {\displaystyle 1/2} , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function. Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the classification noise model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 ≤ η ≤ 1 2 {\displaystyle 0\leq \eta \leq {\frac {1}{2}}} , 0 ≤ ε ≤ 1 {\displaystyle 0\leq \varepsilon \leq 1} and 0 ≤ δ ≤ 1 {\displaystyle 0\leq \delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 − 2 η B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } . == Statistical query learning == Statistical Query Learning is a kind of active learning problem in which the learning algorithm A {\displaystyle {\mathcal {A}}} can decide if to request information about the likelihood P f ( x ) {\displaystyle P_{f(x)}} that a function f {\displaystyle f} correctly labels example x {\displaystyle x} , and receives an answer accurate within a tolerance α {\displaystyle \alpha } . Formally, whenever the learning algorithm A {\displaystyle {\mathcal {A}}} calls the oracle Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} , it receives as feedback probability Q f ( x ) {\displaystyle Q_{f(x)}} , such that Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } . Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the statistical query learning model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} and polynomials p ( ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot )} , q ( ⋅ , ⋅ , ⋅ ) {\displaystyle q(\cdot ,\cdot ,\cdot )} , and r ( ⋅ , ⋅ , ⋅ ) {\displaystyle r(\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} the following hold: Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} can evaluate P f ( x ) {\displaystyle P_{f(x)}} in time q ( 1 ε , n , s i z e ( f ) ) {\displaystyle q\left({\frac {1}{\varepsilon }},n,size(f)\right)} ; 1 α {\displaystyle {\frac {1}{\alpha }}} is bounded by r ( 1 ε , n , s i z e ( f ) ) {\displaystyle r\left({\frac {1}{\varepsilon }},n,size(f)\right)} A {\displaystyle {\mathcal {A}}} outputs a model h {\displaystyle h} such that e r r ( h ) < ε {\displaystyle err(h)<\varepsilon } , in a number of calls to the oracle bounded by p ( 1 ε , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},n,size(f)\right)} . Note that the confidence parameter δ {\displaystyle \delta } does not appear in the definition of learning. This is because the main purpose of δ {\displaystyle \delta } is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} always guarantees to meet the approximation criterion Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } , the failure probability is no longer needed. The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable. == Malicious classification == In the malicious classification model an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm A {\displaystyle {\mathcal {A}}} calls an oracle Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} that returns a correctly labeled example x {\displaystyle x} drawn, as usual, from distribution D {\displaystyle {\mathcal {D}}} over the input space with probability 1 − β {\displaystyle 1-\beta } , but it returns with probability β {\displaystyle \beta } an example drawn from a distribution that is not related to D {\displaystyle {\mathcal {D}}} . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of f {\displaystyle f} , β {\displaystyle \beta } , D {\displaystyle {\mathcal {D}}} , or the current progress of the learning algorithm. Definition: Given a bound β B < 1 2 {\displaystyle \beta _{B}<{\frac {1}{2}}} for 0 ≤ β < 1 2 {\displaystyle 0\leq \beta <{\frac {1}{2}}} , we say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the malicious classification model, if there exist a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )}
Semantic folding
Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex. == Theory == Semantic folding theory draws inspiration from Douglas R. Hofstadter's Analogy as the Core of Cognition which suggests that the brain makes sense of the world by identifying and applying analogies. The theory hypothesises that semantic data must therefore be introduced to the neocortex in such a form as to allow the application of a similarity measure and offers, as a solution, the sparse binary vector employing a two-dimensional topographic semantic space as a distributional reference frame. The theory builds on the computational theory of the human cortex known as hierarchical temporal memory (HTM), and positions itself as a complementary theory for the representation of language semantics. A particular strength claimed by this approach is that the resulting binary representation enables complex semantic operations to be performed simply and efficiently at the most basic computational level. == Two-dimensional semantic space == Analogous to the structure of the neocortex, Semantic Folding theory posits the implementation of a semantic space as a two-dimensional grid. This grid is populated by context-vectors in such a way as to place similar context-vectors closer to each other, for instance, by using competitive learning principles. This vector space model is presented in the theory as an equivalence to the well known word space model described in the information retrieval literature. Given a semantic space (implemented as described above) a word-vector can be obtained for any given word Y by employing the following algorithm: For each position X in the semantic map (where X represents cartesian coordinates) if the word Y is contained in the context-vector at position X then add 1 to the corresponding position in the word-vector for Y else add 0 to the corresponding position in the word-vector for Y The result of this process will be a word-vector containing all the contexts in which the word Y appears and will therefore be representative of the semantics of that word in the semantic space. It can be seen that the resulting word-vector is also in a sparse distributed representation (SDR) format [Schütze, 1993] & [Sahlgreen, 2006]. Some properties of word-SDRs that are of particular interest with respect to computational semantics are: high noise resistance: As a result of similar contexts being placed closer together in the underlying map, word-SDRs are highly tolerant of false or shifted "bits". boolean logic: It is possible to manipulate word-SDRs in a meaningful way using boolean (OR, AND, exclusive-OR) and/or arithmetical (SUBtract) functions . sub-sampling: Word-SDRs can be sub-sampled to a high degree without any appreciable loss of semantic information. topological two-dimensional representation: The SDR representation maintains the topological distribution of the underlying map therefore words with similar meanings will have similar word-vectors. This suggests that a variety of measures can be applied to the calculation of semantic similarity, from a simple overlap of vector elements, to a range of distance measures such as: Euclidean distance, Hamming distance, Jaccard distance, cosine similarity, Levenshtein distance, Sørensen-Dice index, etc. == Semantic spaces == Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch (the fact that the same meaning can be expressed in many ways) and ambiguity of natural language (the fact that the same term can have several meanings). The application of semantic spaces in natural language processing (NLP) aims at overcoming limitations of rule-based or model-based approaches operating on the keyword level. The main drawback with these approaches is their brittleness, and the large manual effort required to create either rule-based NLP systems or training corpora for model learning. Rule-based and machine learning-based models are fixed on the keyword level and break down if the vocabulary differs from that defined in the rules or from the training material used for the statistical models. Research in semantic spaces dates back more than 20 years. In 1996, two papers were published that raised a lot of attention around the general idea of creating semantic spaces: latent semantic analysis from Microsoft and Hyperspace Analogue to Language from the University of California. However, their adoption was limited by the large computational effort required to construct and use those semantic spaces. A breakthrough with regard to the accuracy of modelling associative relations between words (e.g. "spider-web", "lighter-cigarette", as opposed to synonymous relations such as "whale-dolphin", "astronaut-driver") was achieved by explicit semantic analysis (ESA) in 2007. ESA was a novel (non-machine learning) based approach that represented words in the form of vectors with 100,000 dimensions (where each dimension represents an Article in Wikipedia). However practical applications of the approach are limited due to the large number of required dimensions in the vectors. More recently, advances in neural networking techniques in combination with other new approaches (tensors) led to a host of new recent developments: Word2vec from Google and GloVe from Stanford University. Semantic folding represents a novel, biologically inspired approach to semantic spaces where each word is represented as a sparse binary vector with 16,000 dimensions (a semantic fingerprint) in a 2D semantic map (the semantic universe). Sparse binary representation are advantageous in terms of computational efficiency, and allow for the storage of very large numbers of possible patterns. == Visualization == The topological distribution over a two-dimensional grid (outlined above) lends itself to a bitmap type visualization of the semantics of any word or text, where each active semantic feature can be displayed as e.g. a pixel. As can be seen in the images shown here, this representation allows for a direct visual comparison of the semantics of two (or more) linguistic items. Image 1 clearly demonstrates that the two disparate terms "dog" and "car" have, as expected, very obviously different semantics. Image 2 shows that only one of the meaning contexts of "jaguar", that of "Jaguar" the car, overlaps with the meaning of Porsche (indicating partial similarity). Other meaning contexts of "jaguar" e.g. "jaguar" the animal clearly have different non-overlapping contexts. The visualization of semantic similarity using Semantic Folding bears a strong resemblance to the fMRI images produced in a research study conducted by A.G. Huth et al., where it is claimed that words are grouped in the brain by meaning. voxels, little volume segments of the brain, were found to follow a pattern were semantic information is represented along the boundary of the visual cortex with visual and linguistic categories represented on posterior and anterior side respectively.
Evolutionary algorithm
Evolutionary algorithms (EA) reproduce essential elements of biological evolution in a computer algorithm in order to solve "difficult" problems, at least approximately, for which no exact or satisfactory solution methods are known. They are metaheuristics and population-based bio-inspired algorithms and evolutionary computation, which itself are part of the field of computational intelligence. The mechanisms of biological evolution that an EA mainly imitates are reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators. Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolution (microevolutionary processes) and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity. == Generic definition == The following is an example of a generic evolutionary algorithm: Randomly generate the initial population of individuals, the first generation. Evaluate the fitness of each individual in the population. Check, if the goal is reached and the algorithm can be terminated. Select individuals as parents, preferably of higher fitness. Produce offspring with optional crossover (mimicking reproduction). Apply mutation operations on the offspring. Select individuals preferably of lower fitness for replacement with new individuals (mimicking natural selection). Return to 2 == Types == Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem. Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved), by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems. Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem. There are many variants of Genetic Programming: Cartesian genetic programming Gene expression programming Grammatical evolution Linear genetic programming Multi expression programming Evolutionary programming – Similar to evolution strategy, but with a deterministic selection of all parents. Evolution strategy (ES) – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates. The method is mainly used for numerical optimization, although there are also variants for combinatorial tasks. CMA-ES Natural evolution strategy Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems. Coevolutionary algorithm – Similar to genetic algorithms and evolution strategies, but the created solutions are compared on the basis of their outcomes from interactions with other solutions. Solutions can either compete or cooperate during the search process. Coevolutionary algorithms are often used in scenarios where the fitness landscape is dynamic, complex, or involves competitive interactions. Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect. Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach. Quality–Diversity algorithms – QD algorithms simultaneously aim for high-quality and diverse solutions. Unlike traditional optimization algorithms that solely focus on finding the best solution to a problem, QD algorithms explore a wide variety of solutions across a problem space and keep those that are not just high performing, but also diverse and unique. == Theoretical background == The following theoretical principles apply to all or almost all EAs. === No free lunch theorem === The no free lunch theorem of optimization states that all optimization strategies are equally effective when the set of all optimization problems is considered. Under the same condition, no evolutionary algorithm is fundamentally better than another. This can only be the case if the set of all problems is restricted. This is exactly what is inevitably done in practice. Therefore, to improve an EA, it must exploit problem knowledge in some form (e.g. by choosing a certain mutation strength or a problem-adapted coding). Thus, if two EAs are compared, this constraint is implied. In addition, an EA can use problem specific knowledge by, for example, not randomly generating the entire start population, but creating some individuals through heuristics or other procedures. Another possibility to tailor an EA to a given problem domain is to involve suitable heuristics, local search procedures or other problem-related procedures in the process of generating the offspring. This form of extension of an EA is also known as a memetic algorithm. Both extensions play a major role in practical applications, as they can speed up the search process and make it more robust. === Convergence === For EAs in which, in addition to the offspring, at least the best individual of the parent generation is used to form the subsequent generation (so-called elitist EAs), there is a general proof of convergence under the condition that an optimum exists. Without loss of generality, a maximum search is assumed for the proof: From the property of elitist offspring acceptance and the existence of the optimum it follows that per generation k {\displaystyle k} an improvement of the fitness F {\displaystyle F} of the respective best individual x ′ {\displaystyle x'} will occur with a probability P > 0 {\displaystyle P>0} . Thus: F ( x 1 ′ ) ≤ F ( x 2 ′ ) ≤ F ( x 3 ′ ) ≤ ⋯ ≤ F ( x k ′ ) ≤ ⋯ {\displaystyle F(x'_{1})\leq F(x'_{2})\leq F(x'_{3})\leq \cdots \leq F(x'_{k})\leq \cdots } I.e., the fitness values represent a monotonically non-decreasing sequence, which is bounded due to the existence of the optimum. From this follows the convergence of the sequence against the optimum. Since the proof makes no statement about the speed of convergence, it is of little help in practical applications of EAs. But it does justify the recommendation to use elitist EAs. However, when using the usual panmictic population model, elitist EAs tend to converge prematurely more than non-elitist ones. In a panmictic population model, mate selection (see step 4 of the generic definition) is such that every individual in the entire population is eligible as a mate. In non-panmictic populations, selection is suitably restricted, so that the dispersal speed of better individuals is reduced compared to panmictic ones. Thus, the general risk of premature convergence of elitist EAs can be significantly reduced by suitable population models that restrict mate selection. === Virtual alphabets === With the theory of virtual alphabets, David E. Goldberg showed in 1990 that by using a representation with real numbers, an EA that uses classical recombination operators (e.g. uniform or n-point crossover) cannot reach certain areas of the search space, in contrast to a coding with binary numbers. This results in the recommendation for EAs with real representation to use arithmetic operators for recombination (e.g. arithmetic mean or intermediate recombination). With suitable operators, real-valued representations are more effective than binary ones, contrary to earlier opinion. == Comparison to other concepts == === Biological processes === A possible limitation of many evolutionary algorithms is their lack of a clear genotype–phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature p