Snap rounding is a method of approximating line segment locations by creating a grid and placing each point in the centre of a cell (pixel) of the grid. The method preserves certain topological properties of the arrangement of line segments. Drawbacks include the potential interpolation of additional vertices in line segments (lines become polylines), the arbitrary closeness of a point to a non-incident edge, and arbitrary numbers of intersections between input line-segments. The 3 dimensional case is worse, with a polyhedral subdivision of complexity n becoming complexity O(n4). There are more refined algorithms to cope with some of these issues, for example iterated snap rounding guarantees a "large" separation between points and non-incident edges. == Algorithm == ... (please edit). See, and https://www.cgal.org/ () == Properties == Canonicity: Efficiency; A number of efficient implementations exist. Conversely there are undesirable properties: Non-idempotence: Repeated applications can cause arbitrary drift of points. Exception on "Stable snap rounding" algorithms, see https://doi.org/10.1016/j.comgeo.2012.02.011
Google Clips
Google Clips is a discontinued miniature clip-on camera device developed by Google. == History == It was announced on October 4, 2017 and went on sale on January 27, 2018. Google Clips automatically captured video clips (without audio) at moments its machine learning algorithms determined to be interesting or relevant. An indicator flashed when the camera was looking for scenes to capture. Google Clips' artificial intelligence (AI) could learn the faces of people to take photographs with certain people, and could automatically set lighting and framing. It had 16 GB of storage built-in storage and could record clips for up to 3 hours. This camera was originally priced at US$249 in the United States. It was withdrawn from sale on October 15, 2019, but supported until the end of December 2021. == Reception == The Independent wrote that Google Clips is "an impressive little device, but one that also has the potential to feel very creepy." According to The Verge's generally negative review, "it didn't capture anything special" over two weeks of testing.
Diffusion model
In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality. There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using variational inference. The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers. As of 2024, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise. The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation. Other than computer vision, diffusion models have also found applications in natural language processing such as text generation and summarization, sound generation, and reinforcement learning. == Denoising diffusion model == === Non-equilibrium thermodynamics === Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion. Consider, for example, how one might model the distribution of all naturally occurring photos. Each image is a point in the space of all images, and the distribution of naturally occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with pdf ρ ( x ) ∝ e − 1 2 ‖ x ‖ 2 {\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}} . This is just the Maxwell–Boltzmann distribution of particles in a potential well V ( x ) = 1 2 ‖ x ‖ 2 {\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. === Denoising Diffusion Probabilistic Model (DDPM) === The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference. ==== Forward diffusion ==== To present the model, some notation is required. β 1 , . . . , β T ∈ ( 0 , 1 ) {\displaystyle \beta _{1},...,\beta _{T}\in (0,1)} are fixed constants. α t := 1 − β t {\displaystyle \alpha _{t}:=1-\beta _{t}} α ¯ t := α 1 ⋯ α t {\displaystyle {\bar {\alpha }}_{t}:=\alpha _{1}\cdots \alpha _{t}} σ t := 1 − α ¯ t {\displaystyle \sigma _{t}:={\sqrt {1-{\bar {\alpha }}_{t}}}} σ ~ t := σ t − 1 σ t β t {\displaystyle {\tilde {\sigma }}_{t}:={\frac {\sigma _{t-1}}{\sigma _{t}}}{\sqrt {\beta _{t}}}} μ ~ t ( x t , x 0 ) := α t ( 1 − α ¯ t − 1 ) x t + α ¯ t − 1 ( 1 − α t ) x 0 σ t 2 {\displaystyle {\tilde {\mu }}_{t}(x_{t},x_{0}):={\frac {{\sqrt {\alpha _{t}}}(1-{\bar {\alpha }}_{t-1})x_{t}+{\sqrt {{\bar {\alpha }}_{t-1}}}(1-\alpha _{t})x_{0}}{\sigma _{t}^{2}}}} N ( μ , Σ ) {\displaystyle {\mathcal {N}}(\mu ,\Sigma )} is the normal distribution with mean μ {\displaystyle \mu } and variance Σ {\displaystyle \Sigma } , and N ( x | μ , Σ ) {\displaystyle {\mathcal {N}}(x|\mu ,\Sigma )} is the probability density at x {\displaystyle x} . A vertical bar denotes conditioning. A forward diffusion process starts at some starting point x 0 ∼ q {\displaystyle x_{0}\sim q} , where q {\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it by x t = 1 − β t x t − 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}} where z 1 , . . . , z T {\displaystyle z_{1},...,z_{T}} are IID (Independent and identically distributed random variables) samples from N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The coefficients 1 − β t {\displaystyle {\sqrt {1-\beta _{t}}}} and β t {\displaystyle {\sqrt {\beta _{t}}}} ensure that Var ( X t ) = I {\displaystyle {\mbox{Var}}(X_{t})=I} assuming that Var ( X 0 ) = I {\displaystyle {\mbox{Var}}(X_{0})=I} . The values of β t {\displaystyle \beta _{t}} are chosen such that for any starting distribution of x 0 {\displaystyle x_{0}} , if it has finite second moment, then lim t → ∞ x t | x 0 {\displaystyle \lim _{t\to \infty }x_{t}|x_{0}} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The entire diffusion process then satisfies q ( x 0 : T ) = q ( x 0 ) q ( x 1 | x 0 ) ⋯ q ( x T | x T − 1 ) = q ( x 0 ) N ( x 1 | α 1 x 0 , β 1 I ) ⋯ N ( x T | α T x T − 1 , β T I ) {\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0}){\mathcal {N}}(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots {\mathcal {N}}(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)} or ln q ( x 0 : T ) = ln q ( x 0 ) − ∑ t = 1 T 1 2 β t ‖ x t − 1 − β t x t − 1 ‖ 2 + C {\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C} where C {\displaystyle C} is a normalization constant and often omitted. In particular, we note that x 1 : T | x 0 {\displaystyle x_{1:T}|x_{0}} is a Gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with Gaussian process, x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} x t − 1 | x t , x 0 ∼ N ( μ ~ t ( x t , x 0 ) , σ ~ t 2 I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)} In particular, notice that for large t {\displaystyle t} , the variable x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . That is, after a long enough diffusion process, we end up with some x T {\displaystyle x_{T}} that is very close to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with all traces of the original x 0 ∼ q {\displaystyle x_{0}\sim q} gone. For example, since x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} we can sample x t | x 0 {\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all the intermediate steps x 1 , x 2 , . . . , x t − 1 {\displaystyle x_{1},x_{2},...,x_{t-1}} . ==== Backward diffusion ==== The key idea of DDPM is to use a neural network parametrized by θ {\displaystyle \theta } . The network takes in two arguments x t , t {\displaystyle x_{t},t} , and outputs a vector μ θ ( x t , t ) {\displaystyle \mu _{\theta }(x_{t},t)} and a matrix Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} , such that each step in the forward diffusion process can be approximately undone by x t − 1 ∼ N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle x_{t-1}\sim {\mathcal {N}}(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} . This then gives us a backward diffusion process p θ {\displaystyle p_{\theta }} defined by p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x
Conference on Computer Vision and Pattern Recognition
The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.
Variational autoencoder
In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling in 2013. It is part of the families of probabilistic graphical models and variational Bayesian methods. In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately. Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven for semi-supervised learning and supervised learning. == Overview of architecture and operation == A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances, see "Statistical distance VAE variants" below. == Formulation == From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x {\displaystyle x} by their chosen parameterized probability distribution p θ ( x ) = p ( x | θ ) {\displaystyle p_{\theta }(x)=p(x|\theta )} . This distribution is usually chosen to be a Gaussian N ( x | μ , σ ) {\displaystyle N(x|\mu ,\sigma )} which is parameterized by μ {\displaystyle \mu } and σ {\displaystyle \sigma } respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z {\displaystyle z} results in intractable integrals. Let us find p θ ( x ) {\displaystyle p_{\theta }(x)} via marginalizing over z {\displaystyle z} . p θ ( x ) = ∫ z p θ ( x , z ) d z , {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x,z})\,dz,} where p θ ( x , z ) {\displaystyle p_{\theta }({x,z})} represents the joint distribution under p θ {\displaystyle p_{\theta }} of the observable data x {\displaystyle x} and its latent representation or encoding z {\displaystyle z} . According to the chain rule, the equation can be rewritten as p θ ( x ) = ∫ z p θ ( x | z ) p θ ( z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x|z})p_{\theta }(z)\,dz} In the vanilla variational autoencoder, z {\displaystyle z} is usually taken to be a finite-dimensional vector of real numbers, and p θ ( x | z ) {\displaystyle p_{\theta }({x|z})} to be a Gaussian distribution. Then p θ ( x ) {\displaystyle p_{\theta }(x)} is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as Prior p θ ( z ) {\displaystyle p_{\theta }(z)} Likelihood p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} Posterior p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} Unfortunately, the computation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as q ϕ ( z | x ) ≈ p θ ( z | x ) {\displaystyle q_{\phi }({z|x})\approx p_{\theta }({z|x})} with ϕ {\displaystyle \phi } defined as the set of real values that parametrize q {\displaystyle q} . This is sometimes called amortized inference, since by "investing" in finding a good q ϕ {\displaystyle q_{\phi }} , one can later infer z {\displaystyle z} from x {\displaystyle x} quickly without doing any integrals. In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is computed by the probabilistic decoder, and the approximated posterior distribution q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is computed by the probabilistic encoder. Parametrize the encoder as E ϕ {\displaystyle E_{\phi }} , and the decoder as D θ {\displaystyle D_{\theta }} . == Evidence lower bound (ELBO) == Like many deep learning approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through backpropagation. For variational autoencoders, the idea is to jointly optimize the generative model parameters θ {\displaystyle \theta } to reduce the reconstruction error between the input and the output, and ϕ {\displaystyle \phi } to make q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} as close as possible to p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . As reconstruction loss, mean squared error and cross entropy are often used. The Kullback–Leibler divergence D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} can be used as a loss function to squeeze q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} under p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . This divergence loss expands to D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) = E z ∼ q ϕ ( ⋅ | x ) [ ln q ϕ ( z | x ) p θ ( z | x ) ] = E z ∼ q ϕ ( ⋅ | x ) [ ln q ϕ ( z | x ) p θ ( x ) p θ ( x , z ) ] = ln p θ ( x ) + E z ∼ q ϕ ( ⋅ | x ) [ ln q ϕ ( z | x ) p θ ( x , z ) ] . {\displaystyle {\begin{aligned}D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }(z|x)}{p_{\theta }(z|x)}}\right]\\&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})p_{\theta }(x)}{p_{\theta }(x,z)}}\right]\\&=\ln p_{\theta }(x)+\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})}{p_{\theta }(x,z)}}\right].\end{aligned}}} Now, define the evidence lower bound (ELBO): L θ , ϕ ( x ) := E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = ln p θ ( x ) − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ | x ) ) {\displaystyle L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))} Maximizing the ELBO θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\dis
Hedgeable
Hedgeable, Inc. was a U.S. based financial services company and digital wealth management platform headquartered in New York City. Hedgeable was known for not following set allocations, and instead actively managing accounts in response to market movements. On August 9, 2018, Hedgeable closed its doors to new investors, with existing investors required to transfer out of the company. The company claimed that it was not shutting down but simply removing its SEC registration. == History == Hedgeable was founded in 2009 by twin brothers Michael and Matthew Kane, who previously worked at high-net worth investment managers such as Bridgewater Associates and Spruce Private Investors. Both Michael and Matthew graduated from Penn State University with degrees in finance. Hedgeable is a Registered Investment Advisor with the U.S. Securities and Exchange Commission. The company has received funding from SixThirty and Route 66 Ventures as well as various other angel investors. On August 9, 2018, Hedgeable closed its doors to new investors. == Investing Strategies == Hedgeable did not follow a buy-and-hold approach, but instead actively manages accounts in response to market movements focusing on downside protection in bear markets. Their strategy was different from other robo-advisors, which use Modern Portfolio Theory. Hedgeable offered investment options including Exchange Traded Funds (ETFs) to individual stocks, master limited partnerships, private equity and bitcoin. Mutual funds were not used in portfolios. Although the firm's focus was to provide a direct-to-consumer service, Hedgeable's investment strategies were available to financial advisors and institutions as well through a variety of platforms. == Product Features == When it was open to external clients, Hedgeable aimed to gamify their personal finance experience. Clients could open a new account or transfer an existing account. Hedgeable accepted retirement accounts, taxable accounts, business accounts and various other account types. Hedgeable offered the following features: Downside protection Account aggregation Alternative investments Alpha rewards API Mobile app It was awarded 4/5 for client transparency by Paladin Research. Hedgeable was the winner of the Finovate Fall 2015 Best of Show Award and the GREAT 2015 Tech Award (FinTech Category). In 2016, Hedgeable launched its first iOS mobile app in order to expand their product offerings.
One-class classification
In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or assessing the operational status of a nuclear plant as 'normal': In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach. In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm. == Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. == Introduction == SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form, min r , c r 2 subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 ∀ i = 1 , 2 , . . . , n {\displaystyle \min _{r,c}r^{2}{\text{ subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}\;\;\forall i=1,2,...,n} However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below, min r , c , ζ r 2 + 1 ν n ∑ i = 1 n ζ i {\displaystyle \min _{r,c,\zeta }r^{2}+{\frac {1}{\nu n}}\sum _{i=1}^{n}\zeta _{i}} subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 + ζ i ∀ i = 1 , 2 , . . . , n {\displaystyle {\text{subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}+\zeta _{i}\;\;\forall i=1,2,...,n} From the Karush–Kuhn–Tucker conditions for optimality, we get c = ∑ i = 1 n α i Φ ( x i ) , {\displaystyle c=\sum _{i=1}^{n}\alpha _{i}\Phi (x_{i}),} where the α i {\displaystyle \alpha _{i}} 's are the solution to the following optimization problem: max α ∑ i = 1 n α i κ ( x i , x i ) − ∑ i , j = 1 n α i α j κ ( x i , x j ) {\displaystyle \max _{\alpha }\sum _{i=1}^{n}\alpha _{i}\kappa (x_{i},x_{i})-\sum _{i,j=1}^{n}\alpha _{i}\alpha _{j}\kappa (x_{i},x_{j})} subject to, ∑ i = 1 n α i = 1 and 0 ≤ α i ≤ 1 ν n for all i = 1 , 2 , . . . , n . {\displaystyle \sum _{i=1}^{n}\alpha _{i}=1{\text{ and }}0\leq \alpha _{i}\leq {\frac {1}{\nu n}}{\text{for all }}i=1,2,...,n.} The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. === PU (Positive Unlabeled) learning === A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set P {\displaystyle P} and a mixed set U {\displaystyle U} , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, time series, bioinformatics tasks, and remote sensing data. == Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. === Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by: p N ( z ; μ ; Σ ) = 1 ( 2 π ) d 2 | Σ | 1 2 exp { − 1 2 ( z − μ ) T Σ − 1 ( z − μ ) } {\displaystyle p_{\mathcal {N}}(z;\mu ;\Sigma )={\frac {1}{(2\pi )^{\frac {d}{2}}|\Sigma |^{\frac {1}{2}}}}\exp \left\{-{\frac {1}{2}}(z-\mu )^{T}\Sigma ^{-1}(z-\mu )\right\}} Where, μ {\displaystyle \mu } is the mean and Σ {\displaystyle \Sigma } is the covariance matrix. Computing the inverse of covariance matrix ( Σ − 1 {\displaystyle \Sigma ^{-1}} ) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse Σ + {\displaystyle \Sigma ^{+}} is used to approximate the inverse, and is calculated as Σ T ( Σ Σ T ) − 1 {\displaystyle \Sigma ^{T}(\Sigma \Sigma ^{T})^{-1}} . === Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm, k {\displaystyle k} small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized, ε k − c e n t e r = max i ( min k | | x i − μ k | | 2 ) {\displaystyle \varepsilon _{k-center}=\max _{i}(\min _{k}||x_{i}-\mu _{k}||^{2})} The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object z {\displaystyle z} the distance can be calculated as, d k − c e n t r ( z ) = min k | | z − μ k | | 2 {\displaystyle d_{k-centr}(z)=\min _{k}||z-\mu _{k}||^{2}} === Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc. == Applications == === Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history. === Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation, y 0 {\displaystyle y_{0}} , is compared to the target class, C {\displaystyle C} , and identified as an outlier or a member of the target class. === Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initi