AI Assistant Zara

AI Assistant Zara — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Kdb+

    Kdb+

    kdb+ is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by KX Systems. The database is commonly used in high-frequency trading (HFT) to store, analyze, process, and retrieve large data sets at high speed. kdb+ has the ability to handle billions of records and analyzes data within a database. The database is available in 32-bit and 64-bit versions for several operating systems. Financial institutions use kdb+ to analyze time series data such as stock or commodity exchange data. The database has also been used for other time-sensitive data applications including commodity markets such as energy trading, telecommunications, sensor data, log data, machine and computer network usage monitoring along with real time analytics in Formula One racing. == Overview == kdb+ is a high-performance column-store database that was designed to process and store large amounts of data. Commonly accessed data is pushed into random-access memory (RAM), which is faster to access than data in disk storage. Created with financial institutions in mind, the database was developed as a central repository to store time series data that supports real-time analysis of billions of records. kdb+ has the ability to analyze data over time and responds to queries similar to Structured Query Language (SQL). Columnar databases return answers to some queries in a more efficient way than row-based database management systems. kdb+ dictionaries, tables and nanosecond time stamps are native data types and are used to store time series data. At the core of kdb+ is the built-in programming language, q, a concise, expressive query array language, and dialect of the language APL. Q can manipulate streaming, real-time, and historical data. kdb+ uses q to aggregate and analyze data, perform statistical functions, and join data sets and supports SQL queries The vector language q was built for speed and expressiveness and eliminates most need for looping structures. kdb+ includes interfaces in C, C++, Java, C#, and Python. == History == In 1998, KX released kdb, a database built on the language K written by Arthur Whitney. In 2003, kdb+ was released as a 64-bit version of kdb. In 2004, the kdb+ tick market database framework was released along with kdb+ taq, a loader for the New York Stock Exchange (NYSE) taq data. kdb+ was created by Arthur Whitney, building on his prior work with array languages. In April 2007, KX announced that it was releasing a version of kdb+ for Mac OS X. Then, kdb+ was also available on the operating systems Linux, Windows, and Solaris. In September 2012, version 3.0 was released. It was optimized for Intel's upgraded processors with support for WebSockets, and universally unique identifiers (UUIDs, termed globally unique identifiers (GUID)s in Microsoft software). Intel's Advanced Vector Extensions (AVX) and Streaming SIMD Extensions 4 (SSE4) 4.2 on the Sandy Bridge processors of the time allowed for enhanced support of the kdb+ system. In June 2013, version 3.1 was released, with benchmarks up to 8 times faster than older versions. In March 2020, version 4.0 was released. New features included Multithreaded primitives, Intel Optane DC persistent memory support and Data at Rest Encryption.

    Read more →
  • BookCorpus

    BookCorpus

    BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT. The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy. The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service. The dataset was initially hosted on a University of Toronto webpage. An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created. Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.

    Read more →
  • Geographical cluster

    Geographical cluster

    A geographical cluster is a localized anomaly, usually an excess of something given the distribution or variation of something else. Often it is considered as an incidence rate that is unusual in that there is more of some variable than might be expected. Examples would include: a local excess disease rate, a crime hot spot, areas of high unemployment, accident blackspots, unusually high positive residuals from a model, high concentrations of flora or fauna, physical features or events like earthquake epicenters etc... Identifying these extreme regions may be useful in that there could be implicit geographical associations with other variables that can be identified and would be of interest. Pattern detection via the identification of such geographical clusters is a very simple and generic form of geographical analysis that has many applications in many different contexts. The emphasis is on localized clustering or patterning because this may well contain the most useful information. A geographical cluster is different from a high concentration as it is generally second order, involving the factoring in of the distribution of something else. == Geographical cluster detection == Identifying geographical clusters can be an important stage in a geographical analysis. Mapping the locations of unusual concentrations may help identify causes of these. Some techniques include the Geographical Analysis Machine and Besag and Newell's cluster detection method.

    Read more →
  • Neural gas

    Neural gas

    Neural gas is an artificial neural network, inspired by the self-organizing map and introduced in 1991 by Thomas Martinetz and Klaus Schulten. The neural gas is a simple algorithm for finding optimal data representations based on feature vectors. The algorithm was coined "neural gas" because of the dynamics of the feature vectors during the adaptation process, which distribute themselves like a gas within the data space. It is applied where data compression or vector quantization is an issue, for example speech recognition, image processing or pattern recognition. As a robustly converging alternative to the k-means clustering it is also used for cluster analysis. == Algorithm == Suppose we want to model a probability distribution P ( x ) {\displaystyle P(x)} of data vectors x {\displaystyle x} using a finite number of feature vectors w i {\displaystyle w_{i}} , where i = 1 , ⋯ , N {\displaystyle i=1,\cdots ,N} . For each time step t {\displaystyle t} Sample data vector x {\displaystyle x} from P ( x ) {\displaystyle P(x)} Compute the distance between x {\displaystyle x} and each feature vector. Rank the distances. Let i 0 {\displaystyle i_{0}} be the index of the closest feature vector, i 1 {\displaystyle i_{1}} the index of the second closest feature vector, and so on. Update each feature vector by: w i k t + 1 = w i k t + ε ⋅ e − k / λ ⋅ ( x − w i k t ) , k = 0 , ⋯ , N − 1 {\displaystyle w_{i_{k}}^{t+1}=w_{i_{k}}^{t}+\varepsilon \cdot e^{-k/\lambda }\cdot (x-w_{i_{k}}^{t}),k=0,\cdots ,N-1} In the algorithm, ε {\displaystyle \varepsilon } can be understood as the learning rate, and λ {\displaystyle \lambda } as the neighborhood range. ε {\displaystyle \varepsilon } and λ {\displaystyle \lambda } are reduced with increasing t {\displaystyle t} so that the algorithm converges after many adaptation steps. The adaptation step of the neural gas can be interpreted as gradient descent on a cost function. By adapting not only the closest feature vector but all of them with a step size decreasing with increasing distance order, compared to (online) k-means clustering a much more robust convergence of the algorithm can be achieved. The neural gas model does not delete a node and also does not create new nodes. === Comparison with SOM === Compared to self-organized map, the neural gas model does not assume that some vectors are neighbors. If two vectors happen to be close together, they would tend to move together, and if two vectors happen to be apart, they would tend to not move together. In contrast, in an SOM, if two vectors are neighbors in the underlying graph, then they will always tend to move together, no matter whether the two vectors happen to be neighbors in the Euclidean space. The name "neural gas" is because one can imagine it to be what an SOM would be like if there is no underlying graph, and all points are free to move without the bonds that bind them together. == Variants == A number of variants of the neural gas algorithm exists in the literature so as to mitigate some of its shortcomings. More notable is perhaps Bernd Fritzke's growing neural gas, but also one should mention further elaborations such as the Growing When Required network and also the incremental growing neural gas. A performance-oriented approach that avoids the risk of overfitting is the Plastic Neural gas model. === Growing neural gas === Fritzke describes the growing neural gas (GNG) as an incremental network model that learns topological relations by using a "Hebb-like learning rule", only, unlike the neural gas, it has no parameters that change over time and it is capable of continuous learning, i.e. learning on data streams. GNG has been widely used in several domains, demonstrating its capabilities for clustering data incrementally. The GNG is initialized with two randomly positioned nodes which are initially connected with a zero age edge and whose errors are set to 0. Since in the GNG input data is presented sequentially one by one, the following steps are followed at each iteration: It is calculating the errors (distances) between the two closest nodes to the current input data. The error of the winner node (only the closest one) is respectively accumulated. The winner node and its topological neighbors (connected by an edge) are moving towards the current input by different fractions of their respective errors. The age of all edges connected to the winner node are incremented. If the winner node and the second-winner are connected by an edge, such an edge is set to 0. Else, an edge is created between them. If there are edges with an age larger than a threshold, they are removed. Nodes without connections are eliminated. If the current iteration is an integer multiple of a predefined frequency-creation threshold, a new node is inserted between the node with the largest error (among all) and its topological neighbor presenting the highest error. The link between the former and the latter nodes is eliminated (their errors are decreased by a given factor) and the new node is connected to both of them. The error of the new node is initialized as the updated error of the node which had the largest error (among all). The accumulated error of all nodes is decreased by a given factor. If the stopping criterion is not met, the algorithm takes a following input. The criterion might be a given number of epochs, i.e., a pre-set number of times where all data is presented, or the reach of a maximum number of nodes. === Incremental growing neural gas === Another neural gas variant inspired by the GNG algorithm is the incremental growing neural gas (IGNG). The authors propose the main advantage of this algorithm to be "learning new data (plasticity) without degrading the previously trained network and forgetting the old input data (stability)." === Growing when required === Having a network with a growing set of nodes, like the one implemented by the GNG algorithm was seen as a great advantage, however some limitation on the learning was seen by the introduction of the parameter λ, in which the network would only be able to grow when iterations were a multiple of this parameter. The proposal to mitigate this problem was a new algorithm, the Growing When Required network (GWR), which would have the network grow more quickly, by adding nodes as quickly as possible whenever the network identified that the existing nodes would not describe the input well enough. === Plastic neural gas === The ability to only grow a network may quickly introduce overfitting; on the other hand, removing nodes on the basis of age only, as in the GNG model, does not ensure that the removed nodes are actually useless, because removal depends on a model parameter that should be carefully tuned to the "memory length" of the stream of input data. The "Plastic Neural Gas" model solves this problem by making decisions to add or remove nodes using an unsupervised version of cross-validation, which controls an equivalent notion of "generalization ability" for the unsupervised setting. While growing-only methods only cater for the incremental learning scenario, the ability to grow and shrink is suited to the more general streaming data problem. == Implementations == To find the ranking i 0 , i 1 , … , i N − 1 {\displaystyle i_{0},i_{1},\ldots ,i_{N-1}} of the feature vectors, the neural gas algorithm involves sorting, which is a procedure that does not lend itself easily to parallelization or implementation in analog hardware. However, implementations in both parallel software and analog hardware were actually designed.

    Read more →
  • VideoThang

    VideoThang

    VideoThang was free video editing software for Windows 2000, XP, and Vista. The software has three parts to it which are My Stuff, Edit My Stuff, and My Mix. The software accepts MOV, AVI, MPG, MP4, PNG, WMV, FLV, and MP3 standards. Its official website is now no longer available. == Reception == Jan Ozer, of Pcmag, said that the software "suffers from several unfortunate design and implementation flaws that dramatically limit output quality and overall utility." Jon L. Jacobi, of PC World, said that the software "may not be the most flexible multimedia editor in the world, but the trim/zoom basics are there, it's free, and it's so simple to use that just about anyone in the world should be able figure it out." Amit Agarwal, of Digital Inspiration, said that the software "doesn’t offer loads of features like other video editors but is perfect for making quick video slideshows of your pictures that you can upload on the web or share via email."

    Read more →
  • Latent Dirichlet allocation

    Latent Dirichlet allocation

    In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that explains how a collection of text documents can be described by a set of unobserved "topics." For example, given a set of news articles, LDA might discover that one topic is characterized by words like "president", "government", and "election", while another is characterized by "team", "game", and "score". It is one of the most common topic models. The LDA model was first presented as a graphical model for population genetics by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. The model was subsequently applied to machine learning by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Although its most frequent application is in modeling text corpora, it has also been used for other problems, such as in clinical psychology, social science, and computational musicology. The core assumption of LDA is that documents are represented as a random mixture of latent topics, and each topic is characterized by a probability distribution over words. The model is a generalization of probabilistic latent semantic analysis (pLSA), differing primarily in that LDA treats the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. Learning the latent topics and their associated probabilities from a corpus is typically done using Bayesian inference, often with methods like Gibbs sampling or variational Bayes. == History == In the context of population genetics, LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. LDA was applied in machine learning by David Blei, Andrew Ng and Michael I. Jordan in 2003. == Overview == === Population genetics === In population genetics, the model is used to detect the presence of structured genetic variation in a group of individuals. The model assumes that alleles carried by individuals under study have origin in various extant or past populations. The model and various inference algorithms allow scientists to estimate the allele frequencies in those source populations and the origin of alleles carried by individuals under study. The source populations can be interpreted ex-post in terms of various evolutionary scenarios. In association studies, detecting the presence of genetic structure is considered a necessary preliminary step to avoid confounding. === Clinical psychology, mental health, and social science === In clinical psychology research, LDA has been used to identify common themes of self-images experienced by young people in social situations. Other social scientists have used LDA to examine large sets of topical data from discussions on social media (e.g., tweets about prescription drugs). Additionally, supervised Latent Dirichlet Allocation with covariates (SLDAX) has been specifically developed to combine latent topics identified in texts with other manifest variables. This approach allows for the integration of text data as predictors in statistical regression analyses, improving the accuracy of mental health predictions. One of the main advantages of SLDAX over traditional two-stage approaches is its ability to avoid biased estimates and incorrect standard errors, allowing for a more accurate analysis of psychological texts. In the field of social sciences, LDA has proven to be useful for analyzing large datasets, such as social media discussions. For instance, researchers have used LDA to investigate tweets discussing socially relevant topics, like the use of prescription drugs and cultural differences in China. By analyzing these large text corpora, it is possible to uncover patterns and themes that might otherwise go unnoticed, offering valuable insights into public discourse and perception in real time. === Musicology === In the context of computational musicology, LDA has been used to discover tonal structures in different corpora. === Machine learning === One application of LDA in machine learning – specifically, topic discovery, a subproblem in natural language processing – is to discover topics in a collection of documents, and then automatically classify any individual document within the collection in terms of how "relevant" it is to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words or phrases) that, taken together, suggest a shared theme. For example, in a document collection related to pet animals, the terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr, and kitten would suggest a CAT_related theme. There may be many more topics in the collection – e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for simplicity's sake. (Very common, so called stop words in a language – e.g., "the", "an", "that", "are", "is", etc., – would not discriminate between topics and are usually filtered out by pre-processing before LDA is performed. Pre-processing also converts terms to their "root" lexical forms – e.g., "barks", "barking", and "barked" would be converted to "bark".) If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics) based upon the co-occurrence of individual terms, though the task of assigning a meaningful label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often requires specialized knowledge (e.g., for collection of technical documents). The LDA approach assumes that: The semantic content of a document is composed by combining one or more terms from one or more topics. Certain terms are ambiguous, belonging to more than one topic, with different probability. (For example, the term training can apply to both dogs and cats, but are more likely to refer to dogs, which are used as work animals or participate in obedience or skill competitions.) However, in a document, the accompanying presence of specific neighboring terms (which belong to only one topic) will disambiguate their usage. Most documents will contain only a relatively small number of topics. In the collection, e.g., individual topics will occur with differing frequencies. That is, they have a probability distribution, so that a given document is more likely to contain some topics than others. Within a topic, certain terms will be used much more frequently than others. In other words, the terms within a topic will also have their own probability distribution. When LDA machine learning is employed, both sets of probabilities are computed during the training phase, using Bayesian methods and an expectation–maximization algorithm. LDA is a generalization of older approach of probabilistic latent semantic analysis (pLSA), The pLSA model is equivalent to LDA under a uniform Dirichlet prior distribution. pLSA relies on only the first two assumptions above and does not care about the remainder. While both methods are similar in principle and require the user to specify the number of topics to be discovered before the start of training (as with k-means clustering) LDA has the following advantages over pLSA: LDA yields better disambiguation of words and a more precise assignment of documents to topics. Computing probabilities allows a "generative" process by which a collection of new "synthetic documents" can be generated that would closely reflect the statistical characteristics of the original collection. Unlike LDA, pLSA is vulnerable to overfitting especially when the size of corpus increases. The LDA algorithm is more readily amenable to scaling up for large data sets using the MapReduce approach on a computing cluster. == Model == With plate notation, which is often used to represent probabilistic graphical models (PGMs), the dependencies among the many variables can be captured concisely. The boxes are "plates" representing replicates, which are repeated entities. The outer plate represents documents, while the inner plate represents the repeated word positions in a given document; each position is associated with a choice of topic and word. The variable names are defined as follows: M denotes the number of documents N is number of words in a given document (document i has N i {\displaystyle N_{i}} words) α is the parameter of the Dirichlet prior on the per-document topic distributions β is the parameter of the Dirichlet prior on the per-topic word distribution θ i {\displaystyle \theta _{i}} is the topic distribution for document i φ k {\displaystyle \varphi _{k}} is the word distribution for topic k z i j {\displaystyle z_{ij}} is the topic for the j-th word in document i w i j {\displaystyle w_{ij}} is the specific word. The fact that W is grayed out means that words w i j {\displaystyle w_{ij}} are the only observable variables, and the other variables are latent variables. As proposed in the original paper, a sparse Dirichlet prior can be used to model the to

    Read more →
  • Stress majorization

    Stress majorization

    Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of n {\displaystyle n} m {\displaystyle m} -dimensional data items, a configuration X {\displaystyle X} of n {\displaystyle n} points in r {\displaystyle r} ( ≪ m ) {\displaystyle (\ll m)} -dimensional space is sought that minimizes the so-called stress function σ ( X ) {\displaystyle \sigma (X)} . Usually r {\displaystyle r} is 2 {\displaystyle 2} or 3 {\displaystyle 3} , i.e. the ( n × r ) {\displaystyle (n\times r)} matrix X {\displaystyle X} lists points in 2 − {\displaystyle 2-} or 3 − {\displaystyle 3-} dimensional Euclidean space so that the result may be visualised (i.e. an MDS plot). The function σ {\displaystyle \sigma } is a cost or loss function that measures the squared differences between ideal ( m {\displaystyle m} -dimensional) distances and actual distances in r-dimensional space. It is defined as: σ ( X ) = ∑ i < j ≤ n w i j ( d i j ( X ) − δ i j ) 2 {\displaystyle \sigma (X)=\sum _{i Read more →

  • Distribution learning theory

    Distribution learning theory

    The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg ⁡ min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning fram

    Read more →
  • Genigraphics

    Genigraphics

    Genigraphics is a large-format printing service bureau specializing in providing poster session services to medical and scientific conferences throughout the US and Canada. The company began in 1973 as a division of General Electric. == History == Genigraphics began as a computer graphics system, developed by General Electric in the late 1960s, for NASA to use in space flight simulation. The technologies thus developed provided a foundation for the company's expansion into the commercial market. The Computed Images System & Services division (CISS, to become Genigraphics Corporation) of GE delivered the first presentation graphics system to Amoco Oil's corporate headquarters in 1973. It was named the 100 Series, and was based on DEC's PDP 11 series of mini computer systems. The first Genigraphics systems (100 Series and 100A Series) used an array of buttons, dials, knobs and joysticks, along with a built in keyboard, as the means of user interface. The PDP-11/40 computer was housed in a tall cabinet and used random access magnetic tape drives (DECtape) for storing completed presentations. The graphics generator (Forox recorder) was capable of outputting 2,000 line resolution, suitable for 35mm and 72mm film and large sheet film positive using larger cassettes for recording. 4000 and 8000 line resolution was later achieved with duplex scanning and 4x scanning by modifying to the Forox recorder's settings menu. Subsequent models (100B,C,D,D+ and D+/GVP) replaced the knobs and dials with an on screen, text based menu system, a graphics tablet and a pen. The pen/tablet combination gave way to a mouse like device in later models, and served to provide the interface with the graphics tools. User interaction with the computer for functions such as media initialization or modem to modem data transfer required a DECwriter serial terminal. In 1982, GE divested the Genigraphics division along with a host of other "non essential" business units (Genitext, Geniponics) and Genigraphics Corporation was born. Shortly after the divestiture, the headquarters of Genigraphics was moved from Liverpool, New York to Saddle Brook, New Jersey. Major success followed as the company grew exponentially over the next few years selling both systems and slide creation services. Genigraphics film recorders produced high-resolution digital images on 35mm film. The computer-generated scenes for The Last Starfighter were calculated on a Cray X-MP supercomputer and mastered with a Genigraphics film recorder. At its peak, Genigraphics Corporation employed roughly 300 people in 24 offices worldwide, with revenues upwards of $70 million annually. By the late 1980s Genigraphics saw demand for its proprietary systems dwindle and began selling the MASTERPIECE 8770 film recorder and GRAFTIME software as a peripheral for DEC Vaxes, IBM PC AT’s, and Mac NuBus machines. But the MASTERPIECE film recorder proved too expensive to sell in volume. In 1988, the company began a partnership with Microsoft to help develop the PowerPoint software. In exchange, every copy of PowerPoint included a “Send to Genigraphics” link to have files sent to a Genigraphics service bureau to be produced as 35mm slides. This partnership continued until 2001. In 1989, after three years of flat revenue, Genigraphics sold its hardware business in order to focus on its service bureau business and partnership with Microsoft via PowerPoint. In 1994, all assets of Genigraphics, including equipment, software development, in-house artwork, trademarks, and rights to the Microsoft partnership, were sold to InFocus Corporation of Wilsonville, Oregon who continued to operate under the Genigraphics brand name. The twenty-four service bureaus were consolidated to a 20,000 square foot facility next to the FedEx hub in Memphis, Tennessee. This allowed PowerPoint slide orders to be received until 10pm and delivered across the United States by the following morning. In 1995, InFocus registered www.genigraphics.com and was among the first to offer a form of ecommerce allowing 35mm slides, color prints and transparencies, printed booklets, and digital projectors to be purchased online. In 1998, then current management bought Genigraphics from InFocus and have operated it continuously ever since as Genigraphics LLC. That same year, InFocus projector rentals were added to the “Send to Genigraphics” link in PowerPoint and Genigraphics became the rental and repair center for all InFocus national accounts. It also marked Genigraphics entry into the new industry of large format printing; leveraging their knowledge of, and access to, PowerPoint programming code to develop a proprietary printer driver to output directly to an Epson 9500 wide format printer. At the time, Genigraphics was the exclusive 35mm slide vendor for all Kinko’s stores in the United States and poster printing was added to the arrangement. In 2003, Genigraphics closed their 35mm slide E6 photo lab – one of the last high-volume commercial E6 labs in the US – and expanded their large format printing capabilities. Since 2003, Genigraphics has become a major player in the poster session market, providing printing and on-site services to medical and scientific conferences throughout the US and Canada. As of February 2019, over 150,000 medical or scientific ‘ePosters’ are made available through their ResearchPosters.com archive service. === Partnership with Microsoft and development of PowerPoint === As presentations began to be created on personal computers in the late 80’s, Genigraphics sought presentation software partners in Silicon Valley who would be interested in sending files to Genigraphics via dial-up modem to be produced on 35mm slides. In 1987, Michael Beetner, Director of Marketing Planning for Genigraphics, met with Robert Gaskins, head of Microsoft's Graphics Business Unit, who was leading the development of the newly released PowerPoint software. A joint development agreement between Microsoft and Genigraphics was agreed upon and announced at Mac World 1988. According to Erica Robles-Anderson and Patrik Svensson, "It would be hard to overestimate Genigraphics’ influence on PowerPoint. PowerPoint 2.0 was designed for Genigraphics film recorders. It shipped with Genigraphics color palettes, schemes, and the distinctively Genigraphics color-gradient backgrounds. The application contained a ‘Send to Genigraphics’ menu item that wrote the presentation to floppy disk or transmitted the order directly via modem. Within three and a half months PowerPoint orders accounted for ten percent of revenue at Genigraphics service centers. PowerPoint 3.0 was even more intimately dependent upon Genigraphics. The software incorporated a collection of clip art images and symbols that had been produced by hundreds of artists at dozens of service centers across tens of thousands of presentations. Genigraphics artists designed PowerPoint 3.0 colors, templates, and sample presentations. The software even used Genigraphics (rather than Excel) chart style. Bar charts were rendered two-dimensionally with apparent thickness added to make them seemingly recede from the axes. The technique made it easier for viewers to compare bar heights and estimate values from axis ticks and labels. Pie charts were handled analogously. Microsoft paid Genigraphics to produce more than 500 clip art drawings and symbols used in Microsoft programs.” In exchange for Genigraphics development efforts, Microsoft included a “Send to Genigraphics” link in every copy of PowerPoint through the 10.0 version (2000/2001). The arrangement came to an end when Microsoft restructured as a result of anti-trust lawsuits.

    Read more →
  • Latent Dirichlet allocation

    Latent Dirichlet allocation

    In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that explains how a collection of text documents can be described by a set of unobserved "topics." For example, given a set of news articles, LDA might discover that one topic is characterized by words like "president", "government", and "election", while another is characterized by "team", "game", and "score". It is one of the most common topic models. The LDA model was first presented as a graphical model for population genetics by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. The model was subsequently applied to machine learning by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Although its most frequent application is in modeling text corpora, it has also been used for other problems, such as in clinical psychology, social science, and computational musicology. The core assumption of LDA is that documents are represented as a random mixture of latent topics, and each topic is characterized by a probability distribution over words. The model is a generalization of probabilistic latent semantic analysis (pLSA), differing primarily in that LDA treats the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. Learning the latent topics and their associated probabilities from a corpus is typically done using Bayesian inference, often with methods like Gibbs sampling or variational Bayes. == History == In the context of population genetics, LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. LDA was applied in machine learning by David Blei, Andrew Ng and Michael I. Jordan in 2003. == Overview == === Population genetics === In population genetics, the model is used to detect the presence of structured genetic variation in a group of individuals. The model assumes that alleles carried by individuals under study have origin in various extant or past populations. The model and various inference algorithms allow scientists to estimate the allele frequencies in those source populations and the origin of alleles carried by individuals under study. The source populations can be interpreted ex-post in terms of various evolutionary scenarios. In association studies, detecting the presence of genetic structure is considered a necessary preliminary step to avoid confounding. === Clinical psychology, mental health, and social science === In clinical psychology research, LDA has been used to identify common themes of self-images experienced by young people in social situations. Other social scientists have used LDA to examine large sets of topical data from discussions on social media (e.g., tweets about prescription drugs). Additionally, supervised Latent Dirichlet Allocation with covariates (SLDAX) has been specifically developed to combine latent topics identified in texts with other manifest variables. This approach allows for the integration of text data as predictors in statistical regression analyses, improving the accuracy of mental health predictions. One of the main advantages of SLDAX over traditional two-stage approaches is its ability to avoid biased estimates and incorrect standard errors, allowing for a more accurate analysis of psychological texts. In the field of social sciences, LDA has proven to be useful for analyzing large datasets, such as social media discussions. For instance, researchers have used LDA to investigate tweets discussing socially relevant topics, like the use of prescription drugs and cultural differences in China. By analyzing these large text corpora, it is possible to uncover patterns and themes that might otherwise go unnoticed, offering valuable insights into public discourse and perception in real time. === Musicology === In the context of computational musicology, LDA has been used to discover tonal structures in different corpora. === Machine learning === One application of LDA in machine learning – specifically, topic discovery, a subproblem in natural language processing – is to discover topics in a collection of documents, and then automatically classify any individual document within the collection in terms of how "relevant" it is to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words or phrases) that, taken together, suggest a shared theme. For example, in a document collection related to pet animals, the terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr, and kitten would suggest a CAT_related theme. There may be many more topics in the collection – e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for simplicity's sake. (Very common, so called stop words in a language – e.g., "the", "an", "that", "are", "is", etc., – would not discriminate between topics and are usually filtered out by pre-processing before LDA is performed. Pre-processing also converts terms to their "root" lexical forms – e.g., "barks", "barking", and "barked" would be converted to "bark".) If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics) based upon the co-occurrence of individual terms, though the task of assigning a meaningful label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often requires specialized knowledge (e.g., for collection of technical documents). The LDA approach assumes that: The semantic content of a document is composed by combining one or more terms from one or more topics. Certain terms are ambiguous, belonging to more than one topic, with different probability. (For example, the term training can apply to both dogs and cats, but are more likely to refer to dogs, which are used as work animals or participate in obedience or skill competitions.) However, in a document, the accompanying presence of specific neighboring terms (which belong to only one topic) will disambiguate their usage. Most documents will contain only a relatively small number of topics. In the collection, e.g., individual topics will occur with differing frequencies. That is, they have a probability distribution, so that a given document is more likely to contain some topics than others. Within a topic, certain terms will be used much more frequently than others. In other words, the terms within a topic will also have their own probability distribution. When LDA machine learning is employed, both sets of probabilities are computed during the training phase, using Bayesian methods and an expectation–maximization algorithm. LDA is a generalization of older approach of probabilistic latent semantic analysis (pLSA), The pLSA model is equivalent to LDA under a uniform Dirichlet prior distribution. pLSA relies on only the first two assumptions above and does not care about the remainder. While both methods are similar in principle and require the user to specify the number of topics to be discovered before the start of training (as with k-means clustering) LDA has the following advantages over pLSA: LDA yields better disambiguation of words and a more precise assignment of documents to topics. Computing probabilities allows a "generative" process by which a collection of new "synthetic documents" can be generated that would closely reflect the statistical characteristics of the original collection. Unlike LDA, pLSA is vulnerable to overfitting especially when the size of corpus increases. The LDA algorithm is more readily amenable to scaling up for large data sets using the MapReduce approach on a computing cluster. == Model == With plate notation, which is often used to represent probabilistic graphical models (PGMs), the dependencies among the many variables can be captured concisely. The boxes are "plates" representing replicates, which are repeated entities. The outer plate represents documents, while the inner plate represents the repeated word positions in a given document; each position is associated with a choice of topic and word. The variable names are defined as follows: M denotes the number of documents N is number of words in a given document (document i has N i {\displaystyle N_{i}} words) α is the parameter of the Dirichlet prior on the per-document topic distributions β is the parameter of the Dirichlet prior on the per-topic word distribution θ i {\displaystyle \theta _{i}} is the topic distribution for document i φ k {\displaystyle \varphi _{k}} is the word distribution for topic k z i j {\displaystyle z_{ij}} is the topic for the j-th word in document i w i j {\displaystyle w_{ij}} is the specific word. The fact that W is grayed out means that words w i j {\displaystyle w_{ij}} are the only observable variables, and the other variables are latent variables. As proposed in the original paper, a sparse Dirichlet prior can be used to model the to

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Grammatical evolution

    Grammatical evolution

    Grammatical evolution (GE) is a genetic programming (GP) technique (or approach) from evolutionary computation pioneered by Conor Ryan, JJ Collins and Michael O'Neill in 1998 at the BDS Group in the University of Limerick. As in any other GP approach, the objective is to find an executable program, program fragment, or function, which will achieve a good fitness value for a given objective function. In most published work on GP, a LISP-style tree-structured expression is directly manipulated, whereas GE applies genetic operators to an integer string, subsequently mapped to a program (or similar) through the use of a grammar, which is typically expressed in Backus–Naur form. One of the benefits of GE is that this mapping simplifies the application of search to different programming languages and other structures. == Problem addressed == In type-free, conventional Koza-style GP, the function set must meet the requirement of closure: all functions must be capable of accepting as their arguments the output of all other functions in the function set. Usually, this is implemented by dealing with a single data-type such as double-precision floating point. While modern Genetic Programming frameworks support typing, such type-systems have limitations that Grammatical Evolution does not suffer from. == GE's solution == GE offers a solution to the single-type limitation by evolving solutions according to a user-specified grammar (usually a grammar in Backus-Naur form). Therefore, the search space can be restricted, and domain knowledge of the problem can be incorporated. The inspiration for this approach comes from a desire to separate the "genotype" from the "phenotype": in GP, the objects the search algorithm operates on and what the fitness evaluation function interprets are one and the same. In contrast, GE's "genotypes" are ordered lists of integers which code for selecting rules from the provided context-free grammar. The phenotype, however, is the same as in Koza-style GP: a tree-like structure that is evaluated recursively. This model is more in line with how genetics work in nature, where there is a separation between an organism's genotype and the final expression of phenotype in proteins, etc. Separating genotype and phenotype allows a modular approach. In particular, the search portion of the GE paradigm needn't be carried out by any one particular algorithm or method. Observe that the objects GE performs search on are the same as those used in genetic algorithms. This means, in principle, that any existing genetic algorithm package, such as the popular GAlib, can be used to carry out the search, and a developer implementing a GE system need only worry about carrying out the mapping from list of integers to program tree. It is also in principle possible to perform the search using some other method, such as particle swarm optimization (see the remark below); the modular nature of GE creates many opportunities for hybrids as the problem of interest to be solved dictates. Brabazon and O'Neill have successfully applied GE to predicting corporate bankruptcy, forecasting stock indices, bond credit ratings, and other financial applications. GE has also been used with a classic predator-prey model to explore the impact of parameters such as predator efficiency, niche number, and random mutations on ecological stability. It is possible to structure a GE grammar that for a given function/terminal set is equivalent to genetic programming. == Criticism == Despite its successes, GE has been the subject of some criticism. One issue is that as a result of its mapping operation, GE's genetic operators do not achieve high locality which is a highly regarded property of genetic operators in evolutionary algorithms. == Variants == Although GE was originally described in terms of using an Evolutionary Algorithm, specifically, a Genetic Algorithm, other variants exist. For example, GE researchers have experimented with using particle swarm optimization to carry out the searching instead of genetic algorithms with results comparable to that of normal GE; this is referred to as a "grammatical swarm"; using only the basic PSO model it has been found that PSO is probably equally capable of carrying out the search process in GE as simple genetic algorithms are. (Although PSO is normally a floating-point search paradigm, it can be discretized, e.g., by simply rounding each vector to the nearest integer, for use with GE.) Yet another possible variation that has been experimented with in the literature is attempting to encode semantic information in the grammar in order to further bias the search process. Other work showed that, with biased grammars that leverage domain knowledge, even random search can be used to drive GE. == Related work == GE was originally a combination of the linear representation as used by the Genetic Algorithm for Developing Software (GADS) and Backus Naur Form grammars, which were originally used in tree-based GP by Wong and Leung in 1995 and Whigham in 1996. Other related work noted in the original GE paper was that of Frederic Gruau, who used a conceptually similar "embryonic" approach, as well as that of Keller and Banzhaf, which similarly used linear genomes. == Implementations == There are several implementations of GE. These include the following.

    Read more →
  • Pixel aspect ratio

    Pixel aspect ratio

    A pixel aspect ratio (PAR) is a mathematical ratio that describes how the width of a pixel in a digital image compares to the height of that pixel. Most digital imaging systems display an image as a grid of tiny, square pixels. However, some imaging systems, especially those that must be compatible with standard-definition television motion pictures, display an image as a grid of rectangular pixels, in which the pixel width and height are different. Pixel aspect ratio describes this difference. Use of pixel aspect ratio mostly involves pictures pertaining to standard-definition television and some other exceptional cases. Most other imaging systems, including those that comply with SMPTE standards and practices, use square pixels. PAR is also known as sample aspect ratio and abbreviated SAR, though it can be confused with storage aspect ratio. == Introduction == The ratio of the width to the height of an image is known as the aspect ratio, or more precisely the display aspect ratio (DAR) – the aspect ratio of the image as displayed; for TV, DAR was traditionally 4:3 (a.k.a. fullscreen), with 16:9 (a.k.a. widescreen) now the standard for HDTV. In digital images, there is a distinction with the storage aspect ratio (SAR), which is the ratio of pixel dimensions. If an image is displayed with square pixels, then these ratios agree; if not, then non-square, "rectangular" pixels are used, and these ratios disagree. The aspect ratio of the pixels themselves is known as the pixel aspect ratio (PAR) – for square pixels this is 1:1 – and these are related by the identity: Rearranging (solving for PAR) yields: For example: A 640 × 480 VGA image has a SAR of 640/480 = 4:3, and if displayed on a 4:3 display (DAR = 4:3) has square pixels, hence a PAR of 1:1. By contrast, a 720 × 576 D-1 PAL image has a SAR of 720/576 = 5:4, but if displayed on a 4:3 display (DAR = 4:3) the PAR is 4/3 : 5/4 = 16:15 ≈ 1.066. This means that the pixels of the PAL picture must be "stretched" by this amount to fit in the 4:3 display. In analog images such as film there is no notion of pixel, nor notion of SAR or PAR, but in the digitization of analog images the resulting digital image has pixels, hence SAR (and accordingly PAR, if displayed at the same aspect ratio as the original). Non-square pixels arise often in early digital TV standards, related to digitalization of analog TV signals – whose vertical and "effective" horizontal resolutions differ and are thus best described by non-square pixels – and also in some digital video cameras and computer display modes, such as Color Graphics Adapter (CGA). Today they arise also in transcoding between resolutions with different SARs. Actual displays do not generally have non-square pixels, though digital sensors might; they are rather a mathematical abstraction used in resampling images to convert between resolutions. There are several complicating factors in understanding PAR, particularly as it pertains to digitization of analog video: First, analog video does not have pixels, but rather a raster scan, and thus has a well-defined vertical resolution (the lines of the raster), but not a well-defined horizontal resolution, since each line is an analog signal. However, by a standardized sampling rate, the effective horizontal resolution can be determined by the sampling theorem, as is done below. Second, due to overscan, some of the lines at the top and bottom of the raster are not visible, as are some of the possible image on the left and right – see Overscan: Analog to digital resolution issues. Also, the resolution may be rounded (DV NTSC uses 480 lines, rather than the 486 that are possible). Third, analog video signals are interlaced – each image (frame) is sent as two "fields", each with half the lines. Thus either the pixels are twice as tall as they would be without interlacing, or the image is deinterlaced. == Background == Video is presented as a sequential series of images called video frames. Historically, video frames were created and recorded in analog form. As digital display technology, digital broadcast technology, and digital video compression evolved separately, it resulted in video frame differences that must be addressed using pixel aspect ratio. Digital video frames are generally defined as a grid of pixels used to present each sequential image. The horizontal component is defined by pixels (or samples), and is known as a video line. The vertical component is defined by the number of lines, as in 480 lines. Standard-definition television standards and practices were developed as broadcast technologies and intended for terrestrial broadcasting, and were therefore not designed for digital video presentation. Such standards define an image as an array of well-defined horizontal "Lines", well-defined vertical "Line Duration" and a well-defined picture center. However, there is not a standard-definition television standard that properly defines image edges or explicitly demands a certain number of picture elements per line. Furthermore, analog video systems such as NTSC 480i and PAL 576i, instead of employing progressively displayed frames, employ fields or interlaced half-frames displayed in an interwoven manner to reduce flicker and double the image rate for smoother motion. === Analog-to-digital conversion === As a result of computers becoming powerful enough to serve as video editing tools, video digital-to-analog converters and analog-to-digital converters were made to overcome this incompatibility. To convert analog video lines into a series of square pixels, the industry adopted a default sampling rate at which luma values were extracted into pixels. The luma sampling rate for 480i pictures was 12+3⁄11 MHz and for 576i pictures was 14+3⁄4 MHz. The term pixel aspect ratio was first coined when ITU-R BT.601 (commonly known as Rec. 601) specified that standard-definition television pictures are made of lines of exactly 720 non-square pixels. ITU-R BT.601 did not define the exact pixel aspect ratio but did provide enough information to calculate the exact pixel aspect ratio based on industry practices: The standard luma sampling rate of precisely 13+1⁄2 MHz. Based on this information: The pixel aspect ratio for 480i would be 10:11 as: 12 3 11 ÷ 13 1 2 = 10 11 {\displaystyle 12{\tfrac {3}{11}}\div 13{\tfrac {1}{2}}={\tfrac {10}{11}}} The pixel aspect ratio for 576i would be 59:54 as: 14 3 4 ÷ 13 1 2 = 59 54 {\displaystyle 14{\tfrac {3}{4}}\div 13{\tfrac {1}{2}}={\tfrac {59}{54}}} SMPTE RP 187 further attempted to standardize the pixel aspect ratio values for 480i and 576i. It designated 177:160 for 480i or 1035:1132 for 576i. However, due to significant difference with practices in effect by industry and the computational load that they imposed upon the involved hardware, SMPTE RP 187 was simply ignored. SMPTE RP 187 information annex A.4 further suggested the use of 10:11 for 480i. As of this writing, ITU-R BT.601-6, which is the latest edition of ITU-R BT.601, still implies that the pixel aspect ratios mentioned above are correct. === Digital video processing === As stated above, ITU-R BT.601 specified that standard-definition television pictures are made of lines of 720 non-square pixels, sampled with a precisely specified sampling rate. A simple mathematical calculation reveals that a 704 pixel width would be enough to contain a 480i or 576i standard 4:3 picture: A 4:3 480-line picture, digitized with the Rec. 601-recommended sampling rate, would be 704 non-square pixels wide. x 480 × 10 11 = 4 3 ⇒ x = 480 × 11 × 4 10 × 3 = 704 {\displaystyle {\frac {x}{480}}\times {\frac {10}{11}}={\frac {4}{3}}\Rightarrow x={\frac {480\times 11\times 4}{10\times 3}}=704} A 4:3 576-line picture, digitized with the Rec. 601-recommended sampling rate, would be 702+54⁄59 non-square pixels wide. x 576 × 59 54 = 4 3 ⇒ x = 576 × 54 × 4 59 × 3 = 702 54 59 {\displaystyle {\frac {x}{576}}\times {\frac {59}{54}}={\frac {4}{3}}\Rightarrow x={\frac {576\times 54\times 4}{59\times 3}}=702{\tfrac {54}{59}}} Unfortunately, not all standard TV pictures are exactly 4:3: As mentioned earlier, in analog video, the center of a picture is well-defined but the edges of the picture are not standardized. As a result, some analog devices (mostly PAL devices but also some NTSC devices) generated motion pictures that were horizontally (slightly) wider. This also proportionately applies to anamorphic widescreen (16:9) pictures. Therefore, to maintain a safe margin of error, ITU-R BT.601 required sampling 16 more non-square pixels per line (8 more at each edge) to ensure saving all video data near the margins. This requirement, however, had implications for PAL motion pictures. PAL pixel aspect ratios for standard (4:3) and anamorphic wide screen (16:9), respectively 59:54 and 118:81, were awkward for digital image processing, especially for mixing PAL and NTSC video clips. Therefore, video editing products chose the almost equivalent value

    Read more →
  • Waffles (machine learning)

    Waffles (machine learning)

    Waffles is a collection of command-line tools for performing machine learning operations developed at Brigham Young University. These tools are written in C++, and are available under the GNU Lesser General Public License. == Description == The Waffles machine learning toolkit contains command-line tools for performing various operations related to machine learning, data mining, and predictive modeling. The primary focus of Waffles is to provide tools that are simple to use in scripted experiments or processes. For example, the supervised learning algorithms included in Waffles are all designed to support multi-dimensional labels, classification and regression, automatically impute missing values, and automatically apply necessary filters to transform the data to a type that the algorithm can support, such that arbitrary learning algorithms can be used with arbitrary data sets. Many other machine learning toolkits provide similar functionality, but require the user to explicitly configure data filters and transformations to make it compatible with a particular learning algorithm. The algorithms provided in Waffles also have the ability to automatically tune their own parameters (with the cost of additional computational overhead). Because Waffles is designed for script-ability, it deliberately avoids presenting its tools in a graphical environment. It does, however, include a graphical "wizard" tool that guides the user to generate a command that will perform a desired task. This wizard does not actually perform the operation, but requires the user to paste the command that it generates into a command terminal or a script. The idea motivating this design is to prevent the user from becoming "locked in" to a graphical interface. All of the Waffles tools are implemented as thin wrappers around functionality in a C++ class library. This makes it possible to convert scripted processes into native applications with minimal effort. Waffles was first released as an open source project in 2005. Since that time, it has been developed at Brigham Young University, with a new version having been released approximately every 6–9 months. Waffles is not an acronym—the toolkit was named after the food for historical reasons. == Advantages == Some of the advantages of Waffles in contrast with other popular open source machine learning toolkits include: Waffles automatically takes care of many issues related to data format in order to simplify its tools. Because it is implemented in C++, many of its algorithms are particularly fast. Also, the lack of dependency on any virtual machine makes it easier to deploy in conjunction with other applications. The functionality included in Waffles is very broad, including algorithms for dimensionality reduction, collaborative filtering, visualization, clustering, supervised learning, optimization, linear algebra, data transformation, image and signal processing, policy learning, and sparse matrix operations. == Disadvantages == Although Waffles provides significant breadth, it lacks the depth of many toolkits that focus on a particular area of machine learning. The Weka (machine learning) toolkit, for example, provides many more classification algorithms than Waffles provides. Waffles only has a limited graphical interface.

    Read more →
  • Teacher forcing

    Teacher forcing

    Teacher forcing is an algorithm for training the weights of recurrent neural networks (RNNs). It involves feeding observed sequence values (i.e. ground-truth samples) back into the RNN after each step, thus forcing the RNN to stay close to the ground-truth sequence. The term "teacher forcing" can be motivated by comparing the RNN to a human student taking a multi-part exam where the answer to each part (for example a mathematical calculation) depends on the answer to the preceding part. In this analogy, rather than grading every answer in the end, with the risk that the student fails every single part even though they only made a mistake in the first one, a teacher records the score for each individual part and then tells the student the correct answer, to be used in the next part. The use of an external teacher signal is in contrast to real-time recurrent learning (RTRL). Teacher signals are known from oscillator networks. The promise is, that teacher forcing helps to reduce the training time. The term "teacher forcing" was introduced in 1989 by Ronald J. Williams and David Zipser, who reported that the technique was already being "frequently used in dynamical supervised learning tasks" around that time. A NeurIPS 2016 paper introduced the related method of "professor forcing".

    Read more →