Kruti

Kruti

Kruti is a multilingual AI agent and chatbot developed by the Indian company Ola Krutrim. It is designed to perform real-world tasks for users, such as booking taxis and ordering food, by integrating directly with various online services. It is notable for its ability to understand and respond in multiple Indian languages. Developed by a team founded by Bhavish Aggarwal, Kruti functions as an "agentic" AI, meaning it can reason, plan, and execute multi-step tasks to fulfill a user's request. The backend technology combines several open-source large language models with Ola's proprietary Krutrim V2 model. The system was developed to work primarily on smartphones, addressing the Indian market's specific needs, including language diversity and potential bandwidth constraints. Kruti was officially released in June 2025, replacing an earlier chatbot from the company that was also named Krutrim. Initially supporting 13 languages, the company plans to expand its capabilities to 22 Indian languages. == Background == Kruti is an improved version of Ola's Krutrim chatbot, which was first launched in 2023 and was intended to be replaced by Kruti. It was officially released on 12 June 2025 as an upgrade to passive chatbots, with support for text and voice in 13 Indian languages. As an agentic AI, it can execute tasks with customization and reasoning, providing adaptive answers based on user preferences and past interactions. Kruti is optimized for smartphone usage and designed to accommodate bandwidth constraints and usage patterns in India. To ensure scalability and cost-effective performance, it combines various open-source large language models with Ola's own Krutrim V2, which has 12 billion parameters. Its speech recognition is built to identify regional Indian languages, dialects, and accents. Due to its integration with numerous apps and services, Kruti is context-aware and can proactively complete tasks. Initially connected only with Ola ecosystem services, Krutrim intends to expand and incorporate various Indian services into Kruti, with the goal of adding services from Blinkit, Swiggy, and Uber with respective voice command support. On 20 June 2025, Krutrim acquired the AI platform BharatSah‘AI’yak to increase its involvement in government, education, and agriculture projects. This acquisition will allow Kruti to assist in broadening the scope of BharatSah'AI'yak's work on India-centric, vernacular retrieval-augmented generation AI bots. == Development == Kruti is designed to perform tasks with minimal user input, accepting documents, images, and text, without requiring users to switch between applications. Its agentic framework breaks queries into sub-tasks executed by multiple agents working sequentially or concurrently, with reported accuracy exceeding 90%. Kruti connects to company databases and APIs via the Model Context Protocol and presents responses as summaries, tables, or narratives adapted to user behaviour. The system supports payments via credit/debit cards and UPI. The underlying stack, which includes foundation models and AI training and inference systems, is intended to support adaptation across sectors such as healthcare, education, and finance. Ola Cabs and the Open Network for Digital Commerce have begun integrating Kruti into their platforms pending broader reliability testing.

Serverless computing

Serverless computing is "a cloud service category where the customer can use different cloud capability types without the customer having to provision, deploy and manage either hardware or software resources, other than providing customer application code or providing customer data. Serverless computing represents a form of virtualized computing", according to ISO/IEC 22123-2. Serverless computing is a broad ecosystem that includes the cloud provider, function as a service (FaaS), managed services, tools, frameworks, engineers, stakeholders, and other interconnected elements. == Overview == Serverless is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. The definition of serverless computing has evolved over time, leading to varied interpretations. According to Ben Kehoe, serverless represents a spectrum rather than a rigid definition. Emphasis should shift from strict definitions and specific technologies to adopting a serverless mindset, focusing on leveraging serverless solutions to address business challenges. Serverless computing does not eliminate complexity but shifts much of it from the operations team to the development team. However, this shift is not absolute, as operations teams continue to manage aspects such as identity and access management (IAM), networking, security policies, and cost optimization. Additionally, while breaking down applications into finer-grained components can increase management complexity, the relationship between granularity and management difficulty is not strictly linear. There is often an optimal level of modularization where the benefits outweigh the added management overhead. According to Yan Cui, serverless techniques should be adopted only when they help to deliver customer value faster. And while adopting, organizations should take small steps and de-risk along the way. == Challenges == Serverless applications are prone to fallacies of distributed computing. In addition, they are prone to the following fallacies: Versioning is simple Compensating transactions always work Observability is optional === Monitoring and debugging === Monitoring and debugging serverless applications can present unique challenges due to their distributed, event-driven nature and proprietary environments. Traditional tools may fall short, making it difficult to track execution flows across services. However, modern solutions such as distributed tracing tools (e.g., AWS X-Ray, Datadog), centralized logging, and cloud-agnostic observability platforms are mitigating these challenges. Emerging technologies like OpenTelemetry, AI-powered anomaly detection, and serverless-specific frameworks are further improving visibility and root cause analysis. While challenges persist, advancements in monitoring and debugging tools are steadily addressing these limitations. === Security === According to OWASP, serverless applications are vulnerable to variations of traditional attacks, insecure code, and some serverless-specific attacks (like denial of wallet). So, the risks have changed and attack prevention requires a shift in mindset. === Vendor lock-in === Serverless computing is provided as a third-party service. Applications and software that run in the serverless environment are by default locked to a specific cloud vendor. This issue is exacerbated in serverless computing, as with its increased level of abstraction, public vendors only allow customers to upload code to a FaaS platform without the authority to configure underlying environments. More importantly, when considering a more complex workflow that includes backend-as-a-service (BaaS), a BaaS offering can typically only natively trigger a FaaS offering from the same provider. This makes the workload migration in serverless computing virtually impossible. Therefore, considering how to design and deploy serverless workflows from a multi-cloud perspective could mitigate this. == High-performance computing == Serverless computing may not be ideal for certain high-performance computing (HPC) workloads due to resource limits often imposed by cloud providers, including maximum memory, CPU, and runtime restrictions. For workloads requiring sustained or predictable resource usage, bulk-provisioned servers can sometimes be more cost-effective than the pay-per-use model typical of serverless platforms. However, serverless computing is increasingly capable of supporting specific HPC workloads, particularly those that are highly parallelizable and event-driven, by leveraging its scalability and elasticity. The suitability of serverless computing for HPC continues to evolve with advancements in cloud technologies. == Anti-patterns == The grain of sand anti-pattern refers to the creation of excessively small components (e.g., functions) within a system, often resulting in increased complexity, operational overhead, and performance inefficiencies. Lambda pinball is a related anti-pattern that can occur in serverless architectures when functions (e.g., AWS Lambda, Azure functions) excessively invoke each other in fragmented chains, leading to latency, debugging and testing challenges, and reduced observability. These anti-patterns are associated with the formation of a distributed monolith. These anti-patterns are often addressed through the application of clear domain boundaries, which distinguish between public and published interfaces. Public interfaces are technically accessible interfaces, such as methods, classes, API endpoints, or triggers, but they do not come with formal stability guarantees. In contrast, published interfaces involve an explicit stability contract, including formal versioning, thorough documentation, a defined deprecation policy, and often support for backward compatibility. Published interfaces may also require maintaining multiple versions simultaneously and adhering to formal deprecation processes when breaking changes are introduced. Fragmented chains of function calls are often observed in systems where serverless components (functions) interact with other resources in complex patterns, sometimes described as spaghetti architecture or a distributed monolith. In contrast, systems exhibiting clearer boundaries typically organize serverless components into cohesive groups, where internal public interfaces manage inter-component communication, and published interfaces define communication across group boundaries. This distinction highlights differences in stability guarantees and maintenance commitments, contributing to reduced dependency complexity. Additionally, patterns associated with excessive serverless function chaining are sometimes addressed through architectural strategies that emphasize native service integrations instead of individual functions, a concept referred to as the functionless mindset. However, this approach is noted to involve a steeper learning curve, and integration limitations may vary even within the same cloud vendor ecosystem. Reporting on serverless databases presents challenges, as retrieving data for a reporting service can either break the bounded contexts, reduce the timeliness of the data, or do both. This applies regardless of whether data is pulled directly from databases, retrieved via HTTP, or collected in batches. Mark Richards refers to this as the reach-in reporting anti-pattern. A possible alternative to this approach is for databases to asynchronously push the necessary data to the reporting service instead of the reporting service pulling it. While this method requires a separate contract between services and the reporting service and can be complex to implement, it helps preserve bounded contexts while maintaining a high level of data timeliness. == Principles == Adopting DevSecOps practices can help improve the use and security of serverless technologies. In serverless applications, the distinction between infrastructure and business logic is often blurred, with applications typically distributed across multiple services. To maximize the effectiveness of testing, integration testing is emphasized for serverless applications. Additionally, to facilitate debugging and implementation, orchestration is used within the bounded context, while choreography is employed between different bounded contexts. Ephemeral resources are typically kept together to maintain high cohesion. However, shared resources with long spin-up times, such as AWS RDS clusters and landing zones, are often managed in separate repositories, deployment pipeline, and stacks.

Word2vec

Word2vec is a technique in natural language processing for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013. Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany". == Approach == Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a mapping of the set of words to a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a vector in the space. Word2vec can use either of two model architectures to produce these distributed representations of words: continuous bag of words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus. The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window. Words which are semantically similar should influence these probabilities in similar ways, because semantically similar words should be used in similar contexts. The order of context words does not influence prediction (bag of words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words. After the model is trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space. == Mathematical details == This section is based on expositions. A corpus is a sequence of words. Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus. Let V {\displaystyle V} ("vocabulary") be the set of all words appearing in the corpus C {\displaystyle C} . Our goal is to learn one vector v w ∈ R d {\displaystyle v_{w}\in \mathbb {R} ^{d}} for each word w ∈ V {\displaystyle w\in V} . The idea of skip-gram is that the vector of a word should be close to the vector of each of its neighbors. The idea of CBOW is that the vector-sum of a word's neighbors should be close to the vector of the word. === Continuous bag-of-words (CBOW) === The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i + j : j ∈ N ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)} where N {\displaystyle N} is a set of (non-zero) indices representing the relative locations of nearby words considered to be in w i {\displaystyle w_{i}} 's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} , and the objective is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i − 2 , w i − 1 , w i + 1 , w i + 2 ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})} . In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} . In continuous bag-of-words, the histogram is multiplied by a matrix V {\displaystyle V} to obtain a continuous representation of the word's context. The matrix V {\displaystyle V} is also called a dictionary. Its columns are the word vectors. It has D {\displaystyle D} columns, where D {\displaystyle D} is the size of the dictionary. Let d {\displaystyle d} be the length of each word vector. We have V ∈ R d × D {\displaystyle V\in \mathbb {R} ^{d\times D}} . For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with V {\displaystyle V} , we obtain 2 v the + v cat + v on {\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}} . This is then multiplied with another matrix V ′ {\displaystyle V'} of shape R D × d {\displaystyle \mathbb {R} ^{D\times d}} . Each row of it is a word vector v ′ {\displaystyle v'} . This results in a vector of length D {\displaystyle D} , one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is: − ∑ i ln ⁡ e v w i ′ ⋅ ( ∑ j ∈ N v w j + i ) ∑ w ′ e v w ′ ′ ⋅ ( ∑ j ∈ N v w j + i ) {\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}} where the outer summation ∑ i {\displaystyle \sum _{i}} is over the words in a corpus, the quantity ∑ j ∈ N v w j + i {\displaystyle \sum _{j\in N}v_{w_{j+i}}} is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of V {\displaystyle V} or the "sat"-th row of V ′ {\displaystyle V'} . It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. === Skip-gram === The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ∑ j ∈ N ln ⁡ Pr ( w j + i ∣ w i ) {\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})} . In full formula, the loss function is − ∑ i ∑ j ∈ N ln ⁡ e v w j + i ′ ⋅ v w i ∑ w ′ e v w ′ ′ ⋅ v w i {\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}} Same as CBOW, once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training. == History == During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors. In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

One-class classification

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or assessing the operational status of a nuclear plant as 'normal': In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach. In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm. == Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. == Introduction == SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form, min r , c r 2 subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 ∀ i = 1 , 2 , . . . , n {\displaystyle \min _{r,c}r^{2}{\text{ subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}\;\;\forall i=1,2,...,n} However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below, min r , c , ζ r 2 + 1 ν n ∑ i = 1 n ζ i {\displaystyle \min _{r,c,\zeta }r^{2}+{\frac {1}{\nu n}}\sum _{i=1}^{n}\zeta _{i}} subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 + ζ i ∀ i = 1 , 2 , . . . , n {\displaystyle {\text{subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}+\zeta _{i}\;\;\forall i=1,2,...,n} From the Karush–Kuhn–Tucker conditions for optimality, we get c = ∑ i = 1 n α i Φ ( x i ) , {\displaystyle c=\sum _{i=1}^{n}\alpha _{i}\Phi (x_{i}),} where the α i {\displaystyle \alpha _{i}} 's are the solution to the following optimization problem: max α ∑ i = 1 n α i κ ( x i , x i ) − ∑ i , j = 1 n α i α j κ ( x i , x j ) {\displaystyle \max _{\alpha }\sum _{i=1}^{n}\alpha _{i}\kappa (x_{i},x_{i})-\sum _{i,j=1}^{n}\alpha _{i}\alpha _{j}\kappa (x_{i},x_{j})} subject to, ∑ i = 1 n α i = 1 and 0 ≤ α i ≤ 1 ν n for all i = 1 , 2 , . . . , n . {\displaystyle \sum _{i=1}^{n}\alpha _{i}=1{\text{ and }}0\leq \alpha _{i}\leq {\frac {1}{\nu n}}{\text{for all }}i=1,2,...,n.} The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. === PU (Positive Unlabeled) learning === A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set P {\displaystyle P} and a mixed set U {\displaystyle U} , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, time series, bioinformatics tasks, and remote sensing data. == Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. === Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by: p N ( z ; μ ; Σ ) = 1 ( 2 π ) d 2 | Σ | 1 2 exp ⁡ { − 1 2 ( z − μ ) T Σ − 1 ( z − μ ) } {\displaystyle p_{\mathcal {N}}(z;\mu ;\Sigma )={\frac {1}{(2\pi )^{\frac {d}{2}}|\Sigma |^{\frac {1}{2}}}}\exp \left\{-{\frac {1}{2}}(z-\mu )^{T}\Sigma ^{-1}(z-\mu )\right\}} Where, μ {\displaystyle \mu } is the mean and Σ {\displaystyle \Sigma } is the covariance matrix. Computing the inverse of covariance matrix ( Σ − 1 {\displaystyle \Sigma ^{-1}} ) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse Σ + {\displaystyle \Sigma ^{+}} is used to approximate the inverse, and is calculated as Σ T ( Σ Σ T ) − 1 {\displaystyle \Sigma ^{T}(\Sigma \Sigma ^{T})^{-1}} . === Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm, k {\displaystyle k} small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized, ε k − c e n t e r = max i ( min k | | x i − μ k | | 2 ) {\displaystyle \varepsilon _{k-center}=\max _{i}(\min _{k}||x_{i}-\mu _{k}||^{2})} The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object z {\displaystyle z} the distance can be calculated as, d k − c e n t r ( z ) = min k | | z − μ k | | 2 {\displaystyle d_{k-centr}(z)=\min _{k}||z-\mu _{k}||^{2}} === Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc. == Applications == === Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history. === Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation, y 0 {\displaystyle y_{0}} , is compared to the target class, C {\displaystyle C} , and identified as an outlier or a member of the target class. === Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initi

Feature hashing

In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly (after a modulo operation), rather than looking the indices up in an associative array. In addition to its use for encoding non-numeric values, feature hashing can also be used for dimensionality reduction. This trick is often attributed to Weinberger et al. (2009), but there exists a much earlier description of this method published by John Moody in 1989. == Motivation == === Motivating example === In a typical document classification task, the input to the machine learning algorithm (both during learning and classification) is free text. From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature (independent variable) of each of the documents in both the training and test sets. Machine learning algorithms, however, are typically defined in terms of numerical vectors. Therefore, the bags of words for a set of documents is regarded as a term-document matrix where each row is a single document, and each column is a single feature/word; the entry i, j in such a matrix captures the frequency (or weight) of the j'th term of the vocabulary in document i. (An alternative convention swaps the rows and columns of the matrix, but this difference is immaterial.) Typically, these vectors are extremely sparse—according to Zipf's law. The common approach is to construct, at learning time or prior to that, a dictionary representation of the vocabulary of the training set, and use that to map words to indices. Hash tables and tries are common candidates for dictionary implementation. E.g., the three documents John likes to watch movies. Mary likes movies too. John also likes football. can be converted, using the dictionary to the term-document matrix ( John likes to watch movies Mary too also football 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 0 0 1 1 ) {\displaystyle {\begin{pmatrix}{\textrm {John}}&{\textrm {likes}}&{\textrm {to}}&{\textrm {watch}}&{\textrm {movies}}&{\textrm {Mary}}&{\textrm {too}}&{\textrm {also}}&{\textrm {football}}\\1&1&1&1&1&0&0&0&0\\0&1&0&0&1&1&1&0&0\\1&1&0&0&0&0&0&1&1\end{pmatrix}}} (Punctuation was removed, as is usual in document classification and clustering.) The problem with this process is that such dictionaries take up a large amount of storage space and grow in size as the training set grows. On the contrary, if the vocabulary is kept fixed and not increased with a growing training set, an adversary may try to invent new words or misspellings that are not in the stored vocabulary so as to circumvent a machine learned filter. To address this challenge, Yahoo! Research attempted to use feature hashing for their spam filters. Note that the hashing trick isn't limited to text classification and similar tasks at the document level, but can be applied to any problem that involves large (perhaps unbounded) numbers of features. === Mathematical motivation === Mathematically, a token is an element t {\displaystyle t} in a finite (or countably infinite) set T {\displaystyle T} . Suppose we only need to process a finite corpus, then we can put all tokens appearing in the corpus into T {\displaystyle T} , meaning that T {\displaystyle T} is finite. However, suppose we want to process all possible words made of the English letters, then T {\displaystyle T} is countably infinite. Most neural networks can only operate on real vector inputs, so we must construct a "dictionary" function ϕ : T → R n {\displaystyle \phi :T\to \mathbb {R} ^{n}} . When T {\displaystyle T} is finite, of size | T | = m ≤ n {\displaystyle |T|=m\leq n} , then we can use one-hot encoding to map it into R n {\displaystyle \mathbb {R} ^{n}} . First, arbitrarily enumerate T = { t 1 , t 2 , . . , t m } {\displaystyle T=\{t_{1},t_{2},..,t_{m}\}} , then define ϕ ( t i ) = e i {\displaystyle \phi (t_{i})=e_{i}} . In other words, we assign a unique index i {\displaystyle i} to each token, then map the token with index i {\displaystyle i} to the unit basis vector e i {\displaystyle e_{i}} . One-hot encoding is easy to interpret, but it requires one to maintain the arbitrary enumeration of T {\displaystyle T} . Given a token t ∈ T {\displaystyle t\in T} , to compute ϕ ( t ) {\displaystyle \phi (t)} , we must find out the index i {\displaystyle i} of the token t {\displaystyle t} . Thus, to implement ϕ {\displaystyle \phi } efficiently, we need a fast-to-compute bijection h : T → { 1 , . . . , m } {\displaystyle h:T\to \{1,...,m\}} , then we have ϕ ( t ) = e h ( t ) {\displaystyle \phi (t)=e_{h(t)}} . In fact, we can relax the requirement slightly: It suffices to have a fast-to-compute injection h : T → { 1 , . . . , n } {\displaystyle h:T\to \{1,...,n\}} , then use ϕ ( t ) = e h ( t ) {\displaystyle \phi (t)=e_{h(t)}} . In practice, there is no simple way to construct an efficient injection h : T → { 1 , . . . , n } {\displaystyle h:T\to \{1,...,n\}} . However, we do not need a strict injection, but only an approximate injection. That is, when t ≠ t ′ {\displaystyle t\neq t'} , we should probably have h ( t ) ≠ h ( t ′ ) {\displaystyle h(t)\neq h(t')} , so that probably ϕ ( t ) ≠ ϕ ( t ′ ) {\displaystyle \phi (t)\neq \phi (t')} . At this point, we have just specified that h {\displaystyle h} should be a hashing function. Thus we reach the idea of feature hashing. == Algorithms == === Feature hashing (Weinberger et al. 2009) === The basic feature hashing algorithm presented in (Weinberger et al. 2009) is defined as follows. First, one specifies two hash functions: the kernel hash h : T → { 1 , 2 , . . . , n } {\displaystyle h:T\to \{1,2,...,n\}} , and the sign hash ζ : T → { − 1 , + 1 } {\displaystyle \zeta :T\to \{-1,+1\}} . Next, one defines the feature hashing function: ϕ : T → R n , ϕ ( t ) = ζ ( t ) e h ( t ) {\displaystyle \phi :T\to \mathbb {R} ^{n},\quad \phi (t)=\zeta (t)e_{h(t)}} Finally, extend this feature hashing function to strings of tokens by ϕ : T ∗ → R n , ϕ ( t 1 , . . . , t k ) = ∑ j = 1 k ϕ ( t j ) {\displaystyle \phi :T^{}\to \mathbb {R} ^{n},\quad \phi (t_{1},...,t_{k})=\sum _{j=1}^{k}\phi (t_{j})} where T ∗ {\displaystyle T^{}} is the set of all finite strings consisting of tokens in T {\displaystyle T} . Equivalently, ϕ ( t 1 , . . . , t k ) = ∑ j = 1 k ζ ( t j ) e h ( t j ) = ∑ i = 1 n ( ∑ j : h ( t j ) = i ζ ( t j ) ) e i {\displaystyle \phi (t_{1},...,t_{k})=\sum _{j=1}^{k}\zeta (t_{j})e_{h(t_{j})}=\sum _{i=1}^{n}\left(\sum _{j:h(t_{j})=i}\zeta (t_{j})\right)e_{i}} ==== Geometric properties ==== We want to say something about the geometric property of ϕ {\displaystyle \phi } , but T {\displaystyle T} , by itself, is just a set of tokens, we cannot impose a geometric structure on it except the discrete topology, which is generated by the discrete metric. To make it nicer, we lift it to T → R T {\displaystyle T\to \mathbb {R} ^{T}} , and lift ϕ {\displaystyle \phi } from ϕ : T → R n {\displaystyle \phi :T\to \mathbb {R} ^{n}} to ϕ : R T → R n {\displaystyle \phi :\mathbb {R} ^{T}\to \mathbb {R} ^{n}} by linear extension: ϕ ( ( x t ) t ∈ T ) = ∑ t ∈ T x t ζ ( t ) e h ( t ) = ∑ i = 1 n ( ∑ t : h ( t ) = i x t ζ ( t ) ) e i {\displaystyle \phi ((x_{t})_{t\in T})=\sum _{t\in T}x_{t}\zeta (t)e_{h(t)}=\sum _{i=1}^{n}\left(\sum _{t:h(t)=i}x_{t}\zeta (t)\right)e_{i}} There is an infinite sum there, which must be handled at once. There are essentially only two ways to handle infinities. One may impose a metric, then take its completion, to allow well-behaved infinite sums, or one may demand that nothing is actually infinite, only potentially so. Here, we go for the potential-infinity way, by restricting R T {\displaystyle \mathbb {R} ^{T}} to contain only vectors with finite support: ∀ ( x t ) t ∈ T ∈ R T {\displaystyle \forall (x_{t})_{t\in T}\in \mathbb {R} ^{T}} , only finitely many entries of ( x t ) t ∈ T {\displaystyle (x_{t})_{t\in T}} are nonzero. Define an inner product on R T {\displaystyle \mathbb {R} ^{T}} in the obvious way: ⟨ e t , e t ′ ⟩ = { 1 , if t = t ′ , 0 , else. ⟨ x , x ′ ⟩ = ∑ t , t ′ ∈ T x t x t ′ ⟨ e t , e t ′ ⟩ {\displaystyle \langle e_{t},e_{t'}\rangle ={\begin{cases}1,{\text{ if }}t=t',\\0,{\text{ else.}}\end{cases}}\quad \langle x,x'\rangle =\sum _{t,t'\in T}x_{t}x_{t'}\langle e_{t},e_{t'}\rangle } As a side note, if T {\displaystyle T} is infinite, then the inner product space R T {\displaystyle \mathbb {R} ^{T}} is not complete. Taking its completion would get us to a Hilbert space, which allows well-behaved infinite sums. Now we have an inner product space, with enough structure to describe the geometry of the feature hashing function ϕ : R T → R n {\displaystyle \phi :\ma

Variable kernel density estimation

In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional. == Rationale == Given a set of samples, { x → i } {\displaystyle \lbrace {\vec {x}}_{i}\rbrace } , we wish to estimate the density, P ( x → ) {\displaystyle P({\vec {x}})} , at a test point, x → {\displaystyle {\vec {x}}} : P ( x → ) ≈ W n h D {\displaystyle P({\vec {x}})\approx {\frac {W}{nh^{D}}}} W = ∑ i = 1 n w i {\displaystyle W=\sum _{i=1}^{n}w_{i}} w i = K ( x → − x → i h ) {\displaystyle w_{i}=K\left({\frac {{\vec {x}}-{\vec {x}}_{i}}{h}}\right)} where n is the number of samples, K is the "kernel", h is its width and D is the number of dimensions in x → {\displaystyle {\vec {x}}} . The kernel can be thought of as a simple, linear filter. Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators, the parameter, h, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach will not be covered here. == Balloon estimators == A common method of varying the kernel width is to make it inversely proportional to the density at the test point: h = k [ n P ( x → ) ] 1 / D {\displaystyle h={\frac {k}{\left[nP({\vec {x}})\right]^{1/D}}}} where k is a constant. If we back-substitute the estimated PDF, and assuming a Gaussian kernel function, we can show that W is a constant: W = k D ( 2 π ) D / 2 {\displaystyle W=k^{D}(2\pi )^{D/2}} A similar derivation holds for any kernel whose normalising function is of the order hD, although with a different constant factor in place of the (2 π)D/2 term. This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique. There are two components to the error: a variance term and a bias term. The variance term is given as: e 1 = P ∫ K 2 n h D {\displaystyle e_{1}={\frac {P\int K^{2}}{nh^{D}}}} . The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out: e 2 = h 2 n ∇ 2 P {\displaystyle e_{2}={\frac {h^{2}}{n}}\nabla ^{2}P} An optimal kernel width that minimizes the error of each estimate can thus be derived. == Use for statistical classification == The method is particularly effective when applied to statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately, using different bandwidth parameters, and then compare them as in Taylor. Alternatively, we can divide up the sum based on the class of each sample: P ( j , x → ) ≈ 1 n ∑ i = 1 , c i = j n w i {\displaystyle P(j,{\vec {x}})\approx {\frac {1}{n}}\sum _{i=1,c_{i}=j}^{n}w_{i}} where ci is the class of the ith sample. The class of the test point may be estimated through maximum likelihood.