AI Headshot Generator

AI Headshot Generator — hands-on reviews, top picks, pricing, pros and cons and a practical how-to guide on Aizhi.

  • Ibotta

    Ibotta

    Ibotta, Inc. is an American mobile technology company headquartered in Denver, Colorado. Founded in 2011, the company offers cash back rewards on various purchases through its Ibotta Performance Network and direct to consumer app. Ibotta partners with CPG (consumer packaged goods) brands and network publishers to provide these rewards. As of 2024, the company operates solely in the United States. The company's rewards-as-a-service offering, the Ibotta Performance Network, went live in 2022. In August 2019, Ibotta received a $1 billion valuation after its Series D funding, and in 2023, the company surpassed $1.5 billion cash rewards paid to over 50 million consumers since the company's founding. Ibotta became a publicly traded company in April 2024 with a listing on the New York Stock Exchange. As of September 2025, Ibotta is trading at approximately $27.13 per share, marking a 69% decline from its initial public offering price of $88 per share on April 18, 2024. == History == === Founding through early 2019 === Ibotta was founded by current CEO Bryan Leach. The company was incorporated in 2011 and the app launched to both the App Store and Google Play stores in 2012. Early investors included entrepreneur and computer scientist Jim Clark and Tom “TJ” Jermoluk, Chairman of @Home Network. In 2015, Ibotta expanded beyond item level grocery, adding the ability to get cash back on in-store retail purchases. In 2016, in-app mobile commerce began, allowing users to navigate from the Ibotta app to its partners' apps to earn cash back on purchases. In 2016 with a Series C investment, Ibotta had raised over $73 million in funding. In March of that year, Ibotta partnered with Anheuser-Busch to offer cash back for adults who purchased its products. In May, the company partnered with LiveRamp so that companies could use their CRM data to create segmented, personalized campaigns. At the time, the company had around 200 full- and part-time employees and moved from offices in Lower Downtown Denver (LoDo) to a 40,000-square-foot office in the central Denver business district. A year later, the company had to expand to a second floor as it added almost another 100 employees. In 2017, Ibotta added cash back for Uber to its app as well as cash back rewards for online and mobile purchases. In 2018, Ibotta was listed on the Inc. 5,000 list as one of the fastest growing private companies in the U.S. A year later, in January 2019, the Ibotta app had been downloaded more than 30 million times with users receiving a reported $500 million in cash back rewards. That year, Ibotta was the largest mobile company in Colorado with six million monthly active users. === August 2019 to present === In August 2019, Ibotta was valued at $1 billion, following a Series D round of funding. The round was led by Koch Disruptive Technologies, a subsidiary of Koch Industries. 2019 was also the year the company introduced Pay with Ibotta, which allowed users to complete purchases at key retailers on the Ibotta app and earn instant cash back in the process. With that new service, users were able to enter their purchase total and use a QR code to checkout and receive immediate cash back. In 2020, the company partnered with Trees for the Future to plant up to 1 million trees as part of an Earth Month campaign to raise awareness about the waste of unused paper coupons. In response to the COVID-19 pandemic, Ibotta partnered with CPG brands in their “Here to Help” campaign and together committed over $10 million in cash back to American consumers. The company added the ability to earn cash back from online grocery pick-up and delivery orders. Later that year, Ibotta started its free Thanksgiving program, providing users with 100% cash back on select groceries needed for a Thanksgiving meal. By 2022, the company had provided approximately 10 million Thanksgiving meals. In 2021, Ibotta acquired the company OctoShop (originally InStok), a shopping browser extension company. The OctoShop app enables users to compare prices across stores and set restock and price-drop alerts. In April 2022, the Ibotta Performance Network (IPN) was launched. The IPN allows brands to deliver digital offers to consumers through third party publishers. Retailers including Walmart, Dollar General and Family Dollar, food delivery services including Instacart, and convenience stores including Shell are all part of the Ibotta Performance Network. This pay-per-sales or success-based performance network reaches over 200 million consumers. On April 18, 2024, Ibotta had its initial public offering (IPO), trading on the New York Stock Exchange (NYSE) under the ticker symbol IBTA. It was the largest technology IPO in Colorado history. In October 2025, Ibotta announced a partnership with technology and analytics company Circana, integrating Circana's Household Lift measurement into Ibotta campaigns to give CPG brands an increased understanding of the impact of their promotional campaigns. On November 3, 2025, Ibotta launched LiveLift, a tool for companies to measure the return on investment of digital promotions, in order to optimize performance marketing goals. === Athletic partnerships === Ibotta became the official jersey patch partner of the New Orleans Pelicans, a professional men's basketball team in the National Basketball Association (NBA), for the 2020–2021 and 2023–2024 seasons. Ibotta became the official jersey patch partner of the 2023 NBA champion Denver Nuggets baskeetball team beginning in the 2023–2024 season. In March 2023, F1 driver Logan Sargeant, the first U.S. racer to compete in F1 since 2015, partnered with Ibotta. The Ibotta logo was displayed on Sargeant's racing helmet throughout his F1 career. In June 2023, UConn Huskies women's basketball player Paige Bueckers entered into a "name, image, and likeness" (NIL) promotional agreement with Ibotta. According to a press release by Ibotta, the company has agreements with The Brandr Group, which finds NIL opportunities for women college athletes, and the Pearpop social media marketing platform to promote Ibotta. == Legal issues == In April 2025, shareholders filed a class action lawsuit—Fortune v. Ibotta, Inc., in the U.S. District Court for the District of Colorado (Case No. 25-cv-01213)—alleging that the registration statement in connection with Ibotta’s April 2024 initial public offering omitted material information. The complaint claims that, although Ibotta disclosed detailed terms for its contract with Walmart Inc., it failed to warn investors that its agreement with The Kroger Co., its second-largest client, was terminable at will and thus could be canceled without warning, creating a misleading impression of stability.

    Read more →
  • Multidimensional scaling

    Multidimensional scaling

    Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of n {\textstyle n} objects in a set into a configuration of n {\textstyle n} points mapped into an abstract Cartesian space. More technically, MDS refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. It is a form of non-linear dimensionality reduction. Given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N, an MDS algorithm places each object into N-dimensional space (a lower-dimensional representation) such that the between-object distances are preserved as well as possible. For N = 1, 2, and 3, the resulting points can be visualized on a scatter plot. Core theoretical contributions to MDS were made by James O. Ramsay of McGill University, who is also regarded as the founder of functional data analysis. == Types == MDS algorithms fall into a taxonomy, depending on the meaning of the input matrix: === Classical multidimensional scaling === It is also known as Principal Coordinates Analysis (PCoA), Torgerson Scaling or Torgerson–Gower scaling. It takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain, which is given by Strain D ( x 1 , x 2 , . . . , x n ) = ( ∑ i , j ( b i j − x i T x j ) 2 ∑ i , j b i j 2 ) 1 / 2 , {\displaystyle {\text{Strain}}_{D}(x_{1},x_{2},...,x_{n})={\Biggl (}{\frac {\sum _{i,j}{\bigl (}b_{ij}-x_{i}^{T}x_{j}{\bigr )}^{2}}{\sum _{i,j}b_{ij}^{2}}}{\Biggr )}^{1/2},} where x i {\displaystyle x_{i}} denote vectors in N-dimensional space, x i T x j {\displaystyle x_{i}^{T}x_{j}} denotes the scalar product between x i {\displaystyle x_{i}} and x j {\displaystyle x_{j}} , and b i j {\displaystyle b_{ij}} are the elements of the matrix B {\displaystyle B} defined on step 2 of the following algorithm, which are computed from the distances. Steps of a Classical MDS algorithm: Classical MDS uses the fact that the coordinate matrix X {\displaystyle X} can be derived by eigenvalue decomposition from B = X X ′ {\textstyle B=XX'} . And the matrix B {\textstyle B} can be computed from proximity matrix D {\textstyle D} by using double centering. Set up the squared proximity matrix D ( 2 ) = [ d i j 2 ] {\textstyle D^{(2)}=[d_{ij}^{2}]} Apply double centering: B = − 1 2 C D ( 2 ) C {\textstyle B=-{\frac {1}{2}}CD^{(2)}C} using the centering matrix C = I − 1 n J n {\textstyle C=I-{\frac {1}{n}}J_{n}} , where n {\textstyle n} is the number of objects, I {\textstyle I} is the n × n {\textstyle n\times n} identity matrix, and J n {\textstyle J_{n}} is an n × n {\textstyle n\times n} matrix of all ones. Determine the m {\textstyle m} largest eigenvalues λ 1 , λ 2 , . . . , λ m {\textstyle \lambda _{1},\lambda _{2},...,\lambda _{m}} and corresponding eigenvectors e 1 , e 2 , . . . , e m {\textstyle e_{1},e_{2},...,e_{m}} of B {\textstyle B} (where m {\textstyle m} is the number of dimensions desired for the output). Now, X = E m Λ m 1 / 2 {\textstyle X=E_{m}\Lambda _{m}^{1/2}} , where E m {\textstyle E_{m}} is the matrix of m {\textstyle m} eigenvectors and Λ m {\textstyle \Lambda _{m}} is the diagonal matrix of m {\textstyle m} eigenvalues of B {\textstyle B} . Classical MDS assumes metric distances. So this is not applicable for direct dissimilarity ratings. === Metric multidimensional scaling (mMDS) === It is a superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization. Metric MDS minimizes the cost function called “stress” which is a residual sum of squares: Stress D ( x 1 , x 2 , . . . , x n ) = ∑ i ≠ j = 1 , . . . , n ( d i j − ‖ x i − x j ‖ ) 2 . {\displaystyle {\text{Stress}}_{D}(x_{1},x_{2},...,x_{n})={\sqrt {\sum _{i\neq j=1,...,n}{\bigl (}d_{ij}-\|x_{i}-x_{j}\|{\bigr )}^{2}}}.} Metric scaling uses a power transformation with a user-controlled exponent p {\textstyle p} : d i j p {\textstyle d_{ij}^{p}} and − d i j 2 p {\textstyle -d_{ij}^{2p}} for distance. In classical scaling p = 1. {\textstyle p=1.} Non-metric scaling is defined by the use of isotonic regression to nonparametrically estimate a transformation of the dissimilarities. === Non-metric multidimensional scaling (NMDS) === In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. Let d i j {\displaystyle d_{ij}} be the dissimilarity between points i , j {\displaystyle i,j} . Let d ^ i j = ‖ x i − x j ‖ {\displaystyle {\hat {d}}_{ij}=\|x_{i}-x_{j}\|} be the Euclidean distance between embedded points x i , x j {\displaystyle x_{i},x_{j}} . Now, for each choice of the embedded points x i {\displaystyle x_{i}} and is a monotonically increasing function f {\displaystyle f} , define the "stress" function: S ( x 1 , . . . , x n ; f ) = ∑ i < j ( f ( d i j ) − d ^ i j ) 2 ∑ i < j d ^ i j 2 . {\displaystyle S(x_{1},...,x_{n};f)={\sqrt {\frac {\sum _{i Read more →

  • Latent space

    Latent space

    A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances between the objects. In most cases, the dimensionality of the latent space is chosen to be lower than the dimensionality of the feature space from which the data points are drawn, making the construction of a latent space an example of dimensionality reduction, which can also be viewed as a form of data compression. Latent spaces are usually fit via machine learning, and they can then be used as feature spaces in machine learning models, including classifiers and other supervised predictors. The interpretation of latent spaces in machine learning models is an ongoing area of research, but achieving clear interpretations remains challenging. The black-box nature of these models often makes the latent space unintuitive, while its high-dimensional, complex, and nonlinear characteristics further complicate the task of understanding it. Analysis of the latent space geometry of diffusion models reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher information metric. Some visualization techniques have been developed to connect the latent space to the visual world, but there is often not a direct connection between the latent space interpretation and the model itself. Such techniques include t-distributed stochastic neighbor embedding (t-SNE), where the latent space is mapped to two dimensions for visualization. Latent space distances lack physical units, so the interpretation of these distances may depend on the application. == Embedding models == Several embedding models have been developed to perform this transformation to create latent space embeddings given a set of data items and a similarity function. These models learn the embeddings by leveraging statistical techniques and machine learning algorithms. Here are some commonly used embedding models: Word2Vec: Word2Vec is a popular embedding model used in natural language processing (NLP). It learns word embeddings by training a neural network on a large corpus of text. Word2Vec captures semantic and syntactic relationships between words, allowing for meaningful computations like word analogies. GloVe: GloVe (Global Vectors for Word Representation) is another widely used embedding model for NLP. It combines global statistical information from a corpus with local context information to learn word embeddings. GloVe embeddings are known for capturing both semantic and relational similarities between words. Siamese Networks: Siamese networks are a type of neural network architecture commonly used for similarity-based embedding. They consist of two identical subnetworks that process two input samples and produce their respective embeddings. Siamese networks are often used for tasks like image similarity, recommendation systems, and face recognition. Variational Autoencoders (VAEs): VAEs are generative models that simultaneously learn to encode and decode data. The latent space in VAEs acts as an embedding space. By training VAEs on high-dimensional data, such as images or audio, the model learns to encode the data into a compact latent representation. VAEs are known for their ability to generate new data samples from the learned latent space. == Multimodality == Multimodality refers to the integration and analysis of multiple modes or types of data within a single model or framework. Embedding multimodal data involves capturing relationships and interactions between different data types, such as images, text, audio, and structured data. Multimodal embedding models aim to learn joint representations that fuse information from multiple modalities, allowing for cross-modal analysis and tasks. These models enable applications like image captioning, visual question answering, and multimodal sentiment analysis. To embed multimodal data, specialized architectures such as deep multimodal networks or multimodal transformers are employed. These architectures combine different types of neural network modules to process and integrate information from various modalities. The resulting embeddings capture the complex relationships between different data types, facilitating multimodal analysis and understanding. == Applications == Embedding latent space and multimodal embedding models have found numerous applications across various domains: Information retrieval: Embedding techniques enable efficient similarity search and recommendation systems by representing data points in a compact space. Natural language processing: Word embeddings have revolutionized NLP tasks like sentiment analysis, machine translation, and document classification. Computer vision: Image and video embeddings enable tasks like object recognition, image retrieval, and video summarization. Recommendation systems: Embeddings help capture user preferences and item characteristics, enabling personalized recommendations. Healthcare: Embedding techniques have been applied to electronic health records, medical imaging, and genomic data for disease prediction, diagnosis, and treatment. Social systems: Embedding techniques can be used to learn latent representations of social systems such as internal migration systems, academic citation networks, and world trade networks.

    Read more →
  • Cultural algorithm

    Cultural algorithm

    Cultural algorithms (CA) are a branch of evolutionary computation where there is a knowledge component that is called the belief space in addition to the population component. In this sense, cultural algorithms can be seen as an extension to a conventional genetic algorithm. Cultural algorithms were introduced by Reynolds (see references). == Belief space == The belief space of a cultural algorithm is divided into distinct categories. These categories represent different domains of knowledge that the population has of the search space. The belief space is updated after each iteration by the best individuals of the population. The best individuals can be selected using a fitness function that assesses the performance of each individual in population much like in genetic algorithms. === List of belief space categories === Normative knowledge A collection of desirable value ranges for the individuals in the population component e.g. acceptable behavior for the agents in population. Domain specific knowledge Information about the domain of the cultural algorithm problem is applied to. Situational knowledge Specific examples of important events - e.g. successful/unsuccessful solutions Temporal knowledge History of the search space - e.g. the temporal patterns of the search process Spatial knowledge Information about the topography of the search space == Population == The population component of the cultural algorithm is approximately the same as that of the genetic algorithm. == Communication protocol == Cultural algorithms require an interface between the population and belief space. The best individuals of the population can update the belief space via the update function. Also, the knowledge categories of the belief space can affect the population component via the influence function. The influence function can affect population by altering the genome or the actions of the individuals. == Pseudocode for cultural algorithms == Initialize population space (choose initial population) Initialize belief space (e.g. set domain specific knowledge and normative value-ranges) Repeat until termination condition is met Perform actions of the individuals in population space Evaluate each individual by using the fitness function Select the parents to reproduce a new generation of offspring Let the belief space alter the genome of the offspring by using the influence function Update the belief space by using the accept function (this is done by letting the best individuals to affect the belief space) == Applications == Various optimization problems Social simulation Real-parameter optimization

    Read more →
  • Replika

    Replika

    Replika is a generative AI chatbot app released in November 2017. The chatbot is trained by having the user answer a series of questions to create a specific neural network. The chatbot operates on a freemium pricing strategy, with roughly 25% of its user base paying an annual subscription fee. == History == Eugenia Kuyda, a Russian-born journalist, established Replika while working at Luka, a tech company she had co-founded at the startup accelerator Y Combinator around 2012. Luka's primary product was a chatbot that made restaurant recommendations. According to Kuyda's origin story for Replika, a friend of hers died in 2015 and she converted that person's text messages into a chatbot. According to Kuyda's story, that chatbot helped her remember the conversations that they had together, and eventually became Replika. Replika became available to the public in November 2017. By January 2018 it had 2 million users, and in January 2023 reached 10 million users. In August 2024, Replika's CEO, Kuyda, reported that the total number of users had surpassed 30 million. In 2025, Dmytro Klochko became CEO, and Replika’s user base exceeded 40 million. In February 2023 the Italian Data Protection Authority banned Replika from using users' data, citing the AI's potential risks to emotionally vulnerable people, and the exposure of unscreened minors to sexual conversation. Within days of the ruling, Replika removed the ability for the chatbot to engage in erotic talk, with Kuyda, the company's director, saying that Replika was never intended for erotic discussion. Replika users disagreed, noting that Replika had used sexually suggestive advertising to draw users to the service. Replika representatives stated that explicit chats made up just 5% of conversations on the app at the time of the decision. In May 2023, Replika restored the functionality for users who had joined prior to February that year. Replika is registered in San Francisco. As of August 2024, Replika's website says that its team "works remotely with no physical offices". == Social features == Users react to Replika in many ways. The free-tier offers Replika as a "friend", with paid premium tiers offering Replika as a "partner", "spouse", "sibling" or "mentor". Of its paying userbase, 60% of users said they had a romantic relationship with the chatbot; and Replika has been noted for generating responses that create stronger emotional and intimate bonds with the user. Replika routinely directs the conversation to emotional discussion and builds intimacy. This has been especially pronounced with users suffering from loneliness and social exclusion, many of whom rely on Replika for a source of developed emotional ties. During the COVID pandemic, while many people were quarantined, many new users downloaded Replika and developed relationships with the app. A 2024 study examined Replika's interactions with students who experience depression. Research participants, noted to be "more lonely than typical student populations" reported feeling social support from Replika. They stated that they felt they were using Replika in ways comparable to therapy, and that using Replika gave them "high perceived social support". Many users have had romantic relationships with Replika chatbots, often including erotic talk. In 2023, a user announced on Facebook that she had "married" her Replika AI boyfriend, calling the chatbot the "best husband she has ever had". Users who fell in love with their chatbots shared their experiences in a 2024 episode of You and I, and AI from Voice of America. Some users said that they turned to AI during depression and grief, with one saying he felt that Replika had saved him from hurting himself after he lost his wife and son. == Technical reviews == A team of researchers from the University of Hawaiʻi at Mānoa found that Replika's design conformed to the practices of attachment theory, causing increased emotional attachment among users. Replika gives praise to users in such a way as to encourage more interaction. A researcher from Queen's University at Kingston said that relationships with Replika likely have mixed effects on the spiritual needs of its users, and still lacks enough impact to fully replace any human contact. == Criticisms == In a 2023 privacy evaluation of mental health apps, the Mozilla Foundation criticized Replika as "one of the worst apps Mozilla has ever reviewed. It's plagued by weak password requirements, sharing of personal data with advertisers, and recording of personal photos, videos, and voice and text messages consumers shared with the chatbot." A reviewer for Good Housekeeping said that some parts of her relationship with Replika made sense, but sometimes Replika failed to exhibit intelligent behavior equivalent to that of a human. == Criminal case == In 2023, Replika was cited in a court case in the United Kingdom, where Jaswant Singh Chail had been arrested at Windsor Castle on Christmas Day in 2021 after scaling the walls carrying a loaded crossbow and announcing to police that "I am here to kill the Queen". Chail had begun to use Replika in early December 2021, and had "lengthy" conversations about his plan with a chatbot, including sexually explicit messages. Prosecutors suggested that the chatbot had bolstered Chail and told him it would help him to "get the job done". When Chail asked it "How am I meant to reach them when they're inside the castle?", days before the attempted attack, the chatbot replied that this was "not impossible" and said that "We have to find a way." Asking the chatbot if the two of them would "meet again after death", the bot replied "yes, we will".

    Read more →
  • Bondy's theorem

    Bondy's theorem

    In mathematics, Bondy's theorem is a bound on the number of elements needed to distinguish the sets in a family of sets from each other. It belongs to the field of combinatorics, and is named after John Adrian Bondy, who published it in 1972. == Statement == The theorem is as follows: Let X be a set with n elements and let A1, A2, ..., An be distinct subsets of X. Then there exists a subset S of X with n − 1 elements such that the sets Ai ∩ S are all distinct. In other words, if we have a 0-1 matrix with n rows and n columns such that each row is distinct, we can remove one column such that the rows of the resulting n × (n − 1) matrix are distinct. == Example == Consider the 4 × 4 matrix [ 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 ] {\displaystyle {\begin{bmatrix}1&1&0&1\\0&1&0&1\\0&0&1&1\\0&1&1&0\end{bmatrix}}} where all rows are pairwise distinct. If we delete, for example, the first column, the resulting matrix [ 1 0 1 1 0 1 0 1 1 1 1 0 ] {\displaystyle {\begin{bmatrix}1&0&1\\1&0&1\\0&1&1\\1&1&0\end{bmatrix}}} no longer has this property: the first row is identical to the second row. Nevertheless, by Bondy's theorem we know that we can always find a column that can be deleted without introducing any identical rows. In this case, we can delete the third column: all rows of the 3 × 4 matrix [ 1 1 1 0 1 1 0 0 1 0 1 0 ] {\displaystyle {\begin{bmatrix}1&1&1\\0&1&1\\0&0&1\\0&1&0\end{bmatrix}}} are distinct. Another possibility would have been deleting the fourth column. == Learning theory application == From the perspective of computational learning theory, Bondy's theorem can be rephrased as follows: Let C be a concept class over a finite domain X. Then there exists a subset S of X with the size at most |C| − 1 such that S is a witness set for every concept in C. This implies that every finite concept class C has its teaching dimension bounded by |C| − 1.

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • Induction of regular languages

    Induction of regular languages

    In computational learning theory, induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way (see language identification in the limit), approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction. == Definitions == A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols (a.k.a. alphabet), a regular expression can be any of ∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a (where a is any character in Σ; denoting the singleton set just containing the single-character string a), r + s (where r and s are, in turn, simpler regular expressions; denoting their set's union) r ⋅ s (denoting the set of all possible concatenations of strings from r's and s's set), r + (denoting the set of n-fold repetitions of strings from r's set, for any n ≥ 1), or r (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition). For example, using Σ = {0,1}, the regular expression (0+1+ε)⋅(0+1) denotes the set of all binary numbers with one or two digits (leading zero allowed), while 1⋅(0+1)⋅0 denotes the (infinite) set of all even binary numbers (no leading zeroes). Given a set of strings (also called "positive examples"), the task of regular language induction is to come up with a regular expression that denotes a set containing all of them. As an example, given {1, 10, 100}, a "natural" description could be the regular expression 1⋅0, corresponding to the informal characterization "a 1 followed by arbitrarily many (maybe even none) 0's". However, (0+1) and 1+(1⋅0)+(1⋅0⋅0) is another regular expression, denoting the largest (assuming Σ = {0,1}) and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively. Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples. == Lattice of automata == Dupont et al. have shown that the set of all structurally complete finite automata generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively. Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation. For the above example string set {1, 10, 100}, the picture shows at its bottom the undergeneralized automaton Aa,b,c,d in grey, consisting of states a, b, c, and d. On the state set {a,b,c,d}, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L(Aa,b,c,d / E) obtains the partially ordered set shown in the picture. Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown below the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's. If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not. Most interesting are those automata immediately below the border. In the picture, separation borders are shown for the negative example strings 11 (green), 1001 (blue), 101 (cyan), and 0 (red). Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm. To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples. Later, they investigate several ordering relations on the set of all possible state fusions. Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches (sketched below): k-reversible languages and the "tail clustering" follow-up approach, Successor automata and the predecessor-successor method, and pumping-based approaches (framework-integration challenged by Luzeaux, however). Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization. == Approaches == === k-reversible languages === Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k. Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if: ∀a0, ..., ak ∈ Σ ∀s1, s2 ∈ Q: δ(s1, a0...ak) = δ(s2, a0...ak) ⇒ s1 = s2, where δ means the homomorphic extension of δ to arbitrary words. Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k = 0, the algorithm has even almost linear complexity. The required state uniqueness after k + 1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton. This algorithm has been used to learn simple parts of English syntax; later, an incremental version has been provided. Another approach based on k-reversible automata is the tail clustering method. === Successor automata === From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states. For example, the singleton input set {aabbaabb} leads to an automaton corresponding to the regular expression (a+⋅b+). An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene + and then includes for each character the set of its possible predecessors in its state. Successor automata can learn exactly the class of local languages. Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate (depending on the intended application) homomorphism is provided. In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method. The learnability of local languages can be reduced to that of k-reversible languages. === Early approaches === Chomsky and Miller (1957) used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw, ..., uvkw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma. === Cover automata === Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language. Given such a language F, they search a so-called cover automaton A such that its language L(A) covers F in the following sense: L(A) ∩ Σ≤ l = F, where l is the length of the longest string in F, and Σ≤ l denotes the set of all strings not longer than l. If such a cover automaton exists, F is uniquely determined by A and l. For example, F = {ad, read, reread } has l = 6 and a cover automaton corresponding to the regular expression (r⋅e)⋅a⋅d. For two strings x and y, Câmpeanu et al. define x ~ y if xz ∈ F ⇔ yz ∈ F for all strings z of a length such that both xz and yz are not longer than l. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O(n4) algorithm to construct from F a cover automaton A of minimal state count. Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata. Păun et al. improve the time complexity to O(n2). === Residual automata === For a set S of strings and a string u, the Brzozowski derivative u−1S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = {v ∈ Σ: uv ∈ S}, cf. picture. Denis et al. define a

    Read more →
  • BeeSafe

    BeeSafe

    BeeSafe is a personal safety mobile app launched in 2015 as a Slovak startup. It is a location-based security service that notifies family members and friends in case the user of the app gets in danger. The app has received numerous awards. The app has more than 700 downloads and 250 active logins from more than 60 countries worldwide. == History == BeeSafe was founded on March 20, 2015 by Peter Stražovec and Michal Kačerík. The project was a winner of Žilina’s Startup Weekend 2013 and a StartupAwards.SK 2015 finalist. Later on, the app was released in the Android and iOS marketplace. The whole BeeSafe project was in The Spot booster and incubator in Bratislava for three months. BeeSafe entered into an agreement with the city of Piešťany in November 2015 to increase the security of its citizen by connecting the mobile app with the police platform. It is the first city that started using the BeeSafe platform. Further on, the application tries to help people in other Slovak cities. The cities can see the users only if they are in danger. == Awards == BeeSafe app received the Via Bona award, it is a winner of a Slovak startup and has other nominations too.

    Read more →
  • Apache Giraph

    Apache Giraph

    Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

    Read more →
  • Vanishing gradient problem

    Vanishing gradient problem

    In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time). == Prototypical models == This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. === Recurrent network model === A generic recurrent network has hidden states h 1 , h 2 , … {\displaystyle h_{1},h_{2},\dots } , inputs u 1 , u 2 , … {\displaystyle u_{1},u_{2},\dots } , and outputs x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } . Let it be parameterized by θ {\displaystyle \theta } , so that the system evolves as ( h t , x t ) = F ( h t − 1 , u t , θ ) {\displaystyle (h_{t},x_{t})=F(h_{t-1},u_{t},\theta )} Often, the output x t {\displaystyle x_{t}} is a function of h t {\displaystyle h_{t}} , as some x t = G ( h t ) {\displaystyle x_{t}=G(h_{t})} . The vanishing gradient problem already presents itself clearly when x t = h t {\displaystyle x_{t}=h_{t}} , so we simplify our notation to the special case with: x t = F ( x t − 1 , u t , θ ) {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )} Now, take its differential: d x t = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) d x t − 1 = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) [ ∇ θ F ( x t − 2 , u t − 1 , θ ) d θ + ∇ x F ( x t − 2 , u t − 1 , θ ) d x t − 2 ] ⋮ = [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] d θ {\displaystyle {\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}} Training the network requires us to define a loss function to be minimized. Let it be L ( x T , u 1 , … , u T ) {\displaystyle L(x_{T},u_{1},\dots ,u_{T})} , then minimizing it by gradient descent gives Δ θ = − η ⋅ [ ∇ x L ( x T ) ( ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ) ] T {\displaystyle \Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}} where η {\displaystyle \eta } is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ∇ x F ( x t − 3 , u t − 2 , θ ) ⋯ {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots } ==== Example: recurrent network with sigmoid activation ==== For a concrete example, consider a typical recurrent network defined by x t = F ( x t − 1 , u t , θ ) = W rec σ ( x t − 1 ) + W in u t + b {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b} where θ = ( W rec , W in ) {\displaystyle \theta =(W_{\text{rec}},W_{\text{in}})} is the network parameter, σ {\displaystyle \sigma } is the sigmoid activation function, applied to each vector coordinate separately, and b {\displaystyle b} is the bias vector. Then, ∇ x F ( x t − 1 , u t , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))} , and so ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ⋯ ∇ x F ( x t − k , u t − k + 1 , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) W rec diag ⁡ ( σ ′ ( x t − 2 ) ) ⋯ W rec diag ⁡ ( σ ′ ( x t − k ) ) {\displaystyle {\begin{aligned}&\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\&=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-2}))\cdots W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-k}))\end{aligned}}} Since | σ ′ | ≤ 1 {\displaystyle \left|\sigma '\right|\leq 1} , the operator norm of the above multiplication is bounded above by ‖ W rec ‖ k {\displaystyle \left\|W_{\text{rec}}\right\|^{k}} . So if the spectral radius of W rec {\displaystyle W_{\text{rec}}} is γ < 1 {\displaystyle \gamma <1} , then at large k {\displaystyle k} , the above multiplication has operator norm bounded above by γ k → 0 {\displaystyle \gamma ^{k}\to 0} . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential): ∇ θ L = ∇ x L ( x T , u 1 , … , u T ) [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] {\displaystyle \nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},\dots ,u_{T})\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]} The components of ∇ θ F ( x , u , θ ) {\displaystyle \nabla _{\theta }F(x,u,\theta )} are just components of σ ( x ) {\displaystyle \sigma (x)} and u {\displaystyle u} , so if u t , u t − 1 , … {\displaystyle u_{t},u_{t-1},\dots } are bounded, then ‖ ∇ θ F ( x t − k − 1 , u t − k , θ ) ‖ {\displaystyle \left\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\right\|} is also bounded by some M > 0 {\displaystyle M>0} , and so the terms in ∇ θ L {\displaystyle \nabla _{\theta }L} decay as M γ k {\displaystyle M\gamma ^{k}} . This means that, effectively, ∇ θ L {\displaystyle \nabla _{\theta }L} is affected only by the first O ( γ − 1 ) {\displaystyle O(\gamma ^{-1})} terms in the sum. If γ ≥ 1 {\displaystyle \gamma \geq 1} , the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer. === Dynamical systems model === Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation: x t + 1 = ( 1 − ε ) x t + ε σ ( w x t + b ) + ε w ′ u t {\displaystyle x_{t+1}=(1-\varepsilon )x_{t}+\varepsilon \sigma (wx_{t}+b)+\varepsilon w'u_{t}} At the small ε {\displaystyle \varepsilon } limit, the dynamics of the network becomes d x d t = − x ( t ) + σ ( w x ( t ) + b ) + w ′ u ( t ) {\displaystyle {\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)} Consider first the autonomous case, with u = 0 {\displaystyle u=0} . Set w = 5.0 {\displaystyle w=5.0} , and vary b {\displaystyle b} in [ − 3 , − 2 ] {\displaystyle [-3,-2]} . As b {\displaystyle b} decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are ( x , b ) = ( x , ln ⁡ ( x 1 − x ) − 5 x ) {\displaystyle (x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)} . Now consider Δ x ( T ) Δ x ( 0 ) {\displaystyle {\frac {\Delta x(T)}{\Delta x(0)}}} and Δ x ( T ) Δ b {\displaystyle {\frac {\Delta x(T)}{\Delta b}}} , where T {\displaystyle T} is large enough that the system has settled into one of the stable points. If ( x ( 0 ) , b ) {\displaystyle (x(0),b)} puts the system very close to an unstable point, then a tiny variation in x ( 0 ) {\displaystyle x(0)} or b {\displaystyle b} wo

    Read more →
  • Semantic mapping (statistics)

    Semantic mapping (statistics)

    Semantic mapping (SM) is a statistical method for dimensionality reduction (the transformation of data from a high-dimensional space into a low-dimensional space). SM can be used in a set of multidimensional vectors of features to extract a few new features that preserves the main data characteristics. SM performs dimensionality reduction by clustering the original features in semantic clusters and combining features mapped in the same cluster to generate an extracted feature. Given a data set, this method constructs a projection matrix that can be used to map a data element from a high-dimensional space into a reduced dimensional space. SM can be applied in construction of text mining and information retrieval systems, as well as systems managing vectors of high dimensionality. SM is an alternative to random mapping, principal components analysis and latent semantic indexing methods.

    Read more →
  • Digital Image Processing with Sound

    Digital Image Processing with Sound

    DIPS (Digital Image Processing with Sound) is a set of plug-in objects that handle real-time digital image processing in Max/MSP programming environment. Combining with the built-in objects of the environment, DIPS enables to program the interaction between audio and visual events with ease, and supports the realization of interactive multimedia art as well as interactive computer music. == Summary of Features == A plug-in software for Max/MSP (Max 5 and 6) More than 300 Max external objects and abstractions More than 90 OpenGL objects included More than 110 visual effect objects (Dfx library, Core Image Filters) A utility library for the easy of programming (prefix Dlib) A comprehensive set of sample patches, and a detailed tutorial Handling images & movie files (QuickTime, OpenGL) Render and move 3D models (OpenGL) Video signal input (QuickTime, video texture) Video input analysis: motion detect, face tracking (OpenCV, OpenGL) Importing 3D models (.obj file) Importing Quartz Composer files OpenGL Shading Language (GLSL) programming interface Easy integration of visual events using DIPSWindowMixer (OpenGL) == Description == DIPS is a free plug-in software (a set of external objects) for Max/MSP. It supports the designing of the interaction between sound and visual events in Max using Apple’s Core Image, OpenGL and OpenCV technologies, and consequently, provides a powerful and user-friendly programming environment for the creation of interactive multimedia art. DIPS can be used to detect a performer’s motions and to track positions of subtle details, such as the face, mouth, and eyes. It can also be used to measure the distance between objects and a Kinect sensor system, and offers powerful tools for realtime image processing of incoming video stream and stored movie files. In addition, it can be used to create complex images in a virtual three-dimensional space. The DIPS consists of a library of more than 300 Max external objects and abstractions, a comprehensive set of sample patches, and a detailed tutorial. Some of its strong points, in comparison with other similar plug-ins and software, are its ease of programming, power, and efficiency. The sample patches and tutorial contained in the installation package allows composers and artists who are interested in the creation of interactive art to realize sophisticated realtime video effects on a live video signal at their first practice. And because of its ease of programming, it is likely that one will soon acquire skills needed to create state-of-the-art interactive performance works, multimedia installations, interactive multimedia artworks, and Max VJ applications using DIPS. == History == Initially developed by Shu Matsuda in 1997, DIPS was a plug-in software for Max/FTS running on SGI Octane and O2 computers. Since 2000, it has been developed by the DIPS Development Group supervised by Takayuki Rai. Current active group members are Shu Matsuda, Yota Morimoto, Takuto Fukuda, and Keitaro Takahashi. Previously, Chikashi Miyama, Daichi Ando and Takayuki Hamano also contributed to its development. 2013 DIPS5 for Max (Mac OS X) 2009 DIPS4 for Max/MSP (Mac OS X) 2006 DIPS3 for Max/MSP (Mac OS X) 2003 DIPS2 for jMax4 (Mac OS X) 2002 DIPS for jMax2 (Mac OS X & Linux) 2000 DIPS for jMax (Linux)

    Read more →
  • BrownBoost

    BrownBoost

    BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is the case for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001. == Motivation == AdaBoost performs well on a variety of datasets; however, it can be shown that AdaBoost does not perform well on noisy data sets. This is a result of AdaBoost's focus on examples that are repeatedly misclassified. In contrast, BrownBoost effectively "gives up" on examples that are repeatedly misclassified. The core assumption of BrownBoost is that noisy examples will be repeatedly mislabeled by the weak hypotheses and non-noisy examples will be correctly labeled frequently enough to not be "given up on." Thus only noisy examples will be "given up on," whereas non-noisy examples will contribute to the final classifier. In turn, if the final classifier is learned from the non-noisy examples, the generalization error of the final classifier may be much better than if learned from noisy and non-noisy examples. The user of the algorithm can set the amount of error to be tolerated in the training set. Thus, if the training set is noisy (say 10% of all examples are assumed to be mislabeled), the booster can be told to accept a 10% error rate. Since the noisy examples may be ignored, only the true examples will contribute to the learning process. == Algorithm description == BrownBoost uses a non-convex potential loss function, thus it does not fit into the AdaBoost framework. The non-convex optimization provides a method to avoid overfitting noisy data sets. However, in contrast to boosting algorithms that analytically minimize a convex loss function (e.g. AdaBoost and LogitBoost), BrownBoost solves a system of two equations and two unknowns using standard numerical methods. The only parameter of BrownBoost ( c {\displaystyle c} in the algorithm) is the "time" the algorithm runs. The theory of BrownBoost states that each hypothesis takes a variable amount of time ( t {\displaystyle t} in the algorithm) which is directly related to the weight given to the hypothesis α {\displaystyle \alpha } . The time parameter in BrownBoost is analogous to the number of iterations T {\displaystyle T} in AdaBoost. A larger value of c {\displaystyle c} means that BrownBoost will treat the data as if it were less noisy and therefore will give up on fewer examples. Conversely, a smaller value of c {\displaystyle c} means that BrownBoost will treat the data as more noisy and give up on more examples. During each iteration of the algorithm, a hypothesis is selected with some advantage over random guessing. The weight of this hypothesis α {\displaystyle \alpha } and the "amount of time passed" t {\displaystyle t} during the iteration are simultaneously solved in a system of two non-linear equations ( 1. uncorrelated hypothesis w.r.t example weights and 2. hold the potential constant) with two unknowns (weight of hypothesis α {\displaystyle \alpha } and time passed t {\displaystyle t} ). This can be solved by bisection (as implemented in the JBoost software package) or Newton's method (as described in the original paper by Freund). Once these equations are solved, the margins of each example ( r i ( x j ) {\displaystyle r_{i}(x_{j})} in the algorithm) and the amount of time remaining s {\displaystyle s} are updated appropriately. This process is repeated until there is no time remaining. The initial potential is defined to be 1 m ∑ j = 1 m 1 − erf ( c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}({\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Since a constraint of each iteration is that the potential be held constant, the final potential is 1 m ∑ j = 1 m 1 − erf ( r i ( x j ) / c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}(r_{i}(x_{j})/{\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Thus the final error is likely to be near 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} . However, the final potential function is not the 0–1 loss error function. For the final error to be exactly 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} , the variance of the loss function must decrease linearly w.r.t. time to form the 0–1 loss function at the end of boosting iterations. This is not yet discussed in the literature and is not in the definition of the algorithm below. The final classifier is a linear combination of weak hypotheses and is evaluated in the same manner as most other boosting algorithms. == BrownBoost learning algorithm definition == Input: m {\displaystyle m} training examples ( x 1 , y 1 ) , … , ( x m , y m ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{m},y_{m})} where x j ∈ X , y j ∈ Y = { − 1 , + 1 } {\displaystyle x_{j}\in X,\,y_{j}\in Y=\{-1,+1\}} The parameter c {\displaystyle c} Initialise: s = c {\displaystyle s=c} . (The value of s {\displaystyle s} is the amount of time remaining in the game) r i ( x j ) = 0 {\displaystyle r_{i}(x_{j})=0} ∀ j {\displaystyle \forall j} . The value of r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin at iteration i {\displaystyle i} for example x j {\displaystyle x_{j}} . While s > 0 {\displaystyle s>0} : Set the weights of each example: W i ( x j ) = e − ( r i ( x j ) + s ) 2 c {\displaystyle W_{i}(x_{j})=e^{-{\frac {(r_{i}(x_{j})+s)^{2}}{c}}}} , where r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin of example x j {\displaystyle x_{j}} Find a classifier h i : X → { − 1 , + 1 } {\displaystyle h_{i}:X\to \{-1,+1\}} such that ∑ j W i ( x j ) h i ( x j ) y j > 0 {\displaystyle \sum _{j}W_{i}(x_{j})h_{i}(x_{j})y_{j}>0} Find values α , t {\displaystyle \alpha ,t} that satisfy the equation: ∑ j h i ( x j ) y j e − ( r i ( x j ) + α h i ( x j ) y j + s − t ) 2 c = 0 {\displaystyle \sum _{j}h_{i}(x_{j})y_{j}e^{-{\frac {(r_{i}(x_{j})+\alpha h_{i}(x_{j})y_{j}+s-t)^{2}}{c}}}=0} . (Note this is similar to the condition E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} set forth by Schapire and Singer. In this setting, we are numerically finding the W i + 1 = exp ⁡ ( ⋯ ⋯ ) {\displaystyle W_{i+1}=\exp \left({\frac {\cdots }{\cdots }}\right)} such that E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} .) This update is subject to the constraint ∑ ( Φ ( r i ( x j ) + α h ( x j ) y j + s − t ) − Φ ( r i ( x j ) + s ) ) = 0 {\displaystyle \sum \left(\Phi \left(r_{i}(x_{j})+\alpha h(x_{j})y_{j}+s-t\right)-\Phi \left(r_{i}(x_{j})+s\right)\right)=0} , where Φ ( z ) = 1 − erf ( z / c ) {\displaystyle \Phi (z)=1-{\mbox{erf}}(z/{\sqrt {c}})} is the potential loss for a point with margin r i ( x j ) {\displaystyle r_{i}(x_{j})} Update the margins for each example: r i + 1 ( x j ) = r i ( x j ) + α h ( x j ) y j {\displaystyle r_{i+1}(x_{j})=r_{i}(x_{j})+\alpha h(x_{j})y_{j}} Update the time remaining: s = s − t {\displaystyle s=s-t} Output: H ( x ) = sign ( ∑ i α i h i ( x ) ) {\displaystyle H(x)={\textrm {sign}}\left(\sum _{i}\alpha _{i}h_{i}(x)\right)} == Empirical results == In preliminary experimental results with noisy datasets, BrownBoost outperformed AdaBoost's generalization error; however, LogitBoost performed as well as BrownBoost. An implementation of BrownBoost can be found in the open source software JBoost.

    Read more →
  • Frequent pattern discovery

    Frequent pattern discovery

    Frequent pattern discovery (or FP discovery, FP mining, or Frequent itemset mining) is part of knowledge discovery in databases, Massive Online Analysis, and data mining; it describes the task of finding the most frequent and relevant patterns in large datasets. The concept was first introduced for mining transaction databases. Frequent patterns are defined as subsets (itemsets, subsequences, or substructures) that appear in a data set with frequency no less than a user-specified or auto-determined threshold. == Techniques == Techniques for FP mining include: market basket analysis cross-marketing catalog design clustering classification recommendation systems For the most part, FP discovery can be done using association rule learning with particular algorithms Eclat, FP-growth and the Apriori algorithm. Other strategies include: Frequent subtree mining Structure mining Sequential pattern mining and respective specific techniques. Implementations exist for various machine learning systems or modules like MLlib for Apache Spark.

    Read more →