AI App UI Design

AI App UI Design — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Clarizen

    Clarizen

    Clarizen, Inc. is a project management software and collaborative work management company. Clarizen uses a software as a service business model. Clarizen's features include attaching CAD drawings to a project, moving between the project view and design view and an E-mail reporting feature. In May 2014 Clarizen raised $35 million in venture capital investment led by Goldman Sachs. The round brought investment to $90 million. Previous investors, including Benchmark Capital, Carmel Ventures, DAG Ventures, Opus Capital and Vintage Investment Partners participated. In April 2020, Clarizen appointed Matt Zilli as its new CEO, replacing Boaz Chalamish who is appointed as Executive Chairman. In January 2021 Clarizen was acquired by Planview.

    Read more →
  • GNU Octave

    GNU Octave

    GNU Octave is a scientific programming language for scientific computing and numerical computation. Among other things, Octave can be used to solve linear and nonlinear problems numerically and to perform other numerical experiments using a language that is mostly compatible with MATLAB. It may also be used as a batch-oriented language. As part of the GNU Project, it is free software under the terms of the GNU General Public License. == History == The project was conceived around 1988. At first it was intended to be a companion to a chemical reactor design course. Full development was started by John W. Eaton in 1992. The first alpha release dates back to 4 January 1993 and on 17 February 1994 version 1.0 was released. Version 9.2.0 was released on 7 June 2024. The program is named after Octave Levenspiel, a former professor of the principal author. Levenspiel was known for his ability to perform quick back-of-the-envelope calculations. == Development history == == Developments == In addition to use on desktops for personal scientific computing, Octave is used in academia and industry. For example, Octave was used on a massive parallel computer at Pittsburgh Supercomputing Center to find vulnerabilities related to guessing social security numbers. Acceleration with OpenCL or CUDA is also possible with use of GPUs. == Technical details == Octave is written in C++ using the C++ standard library. Octave uses an interpreter to execute the Octave scripting language. Octave is extensible using dynamically loadable modules. Octave interpreter has an OpenGL-based graphics engine to create plots, graphs and charts and to save or print them. Alternatively, gnuplot can be used for the same purpose. Octave includes a graphical user interface (GUI) in addition to the traditional command-line interface (CLI); see #User interfaces for details. == Octave, the language == The Octave language is an interpreted programming language. It is a structured programming language (similar to C) and supports many common C standard library functions, and also certain UNIX system calls and functions. However, it does not support passing arguments by reference although function arguments are copy-on-write to avoid unnecessary duplication. Octave programs consist of a list of function calls or a script. The syntax is matrix-based and provides various functions for matrix operations. It supports various data structures and allows object-oriented programming. Its syntax is very similar to MATLAB, and careful programming of a script will allow it to run on both Octave and MATLAB. Because Octave is made available under the GNU General Public License, it may be freely changed, copied and used. The program runs on Microsoft Windows and most Unix and Unix-like operating systems, including Linux, Android, and macOS. == Notable features == === Command and variable name completion === Typing a TAB character on the command line causes Octave to attempt to complete variable, function, and file names (similar to Bash's tab completion). Octave uses the text before the cursor as the initial portion of the name to complete. === Command history === When running interactively, Octave saves the commands typed in an internal buffer so that they can be recalled and edited. === Data structures === Octave includes a limited amount of support for organizing data in structures. In this example, we see a structure x with elements a, b, and c, (an integer, an array, and a string, respectively): === Short-circuit Boolean operators === Octave's && and || logical operators are evaluated in a short-circuit fashion (like the corresponding operators in the C language), in contrast to the element-by-element operators & and |. === Increment and decrement operators === Octave includes the C-like increment and decrement operators ++ and -- in both their prefix and postfix forms. Octave also does augmented assignment, e.g. x += 5. === Unwind-protect === Octave supports a limited form of exception handling modelled after the unwind_protect of Lisp. The general form of an unwind_protect block looks like this: As a general rule, GNU Octave recognizes as termination of a given block either the keyword end (which is compatible with the MATLAB language) or a more specific keyword endblock or, in some cases, end_block. As a consequence, an unwind_protect block can be terminated either with the keyword end_unwind_protect as in the example, or with the more portable keyword end. The cleanup part of the block is always executed. In case an exception is raised by the body part, cleanup is executed immediately before propagating the exception outside the block unwind_protect. GNU Octave also supports another form of exception handling (compatible with the MATLAB language): This latter form differs from an unwind_protect block in two ways. First, exception_handling is only executed when an exception is raised by body. Second, after the execution of exception_handling the exception is not propagated outside the block (unless a rethrow( lasterror ) statement is explicitly inserted within the exception_handling code). === Variable-length argument lists === Octave has a mechanism for handling functions that take an unspecified number of arguments without explicit upper limit. To specify a list of zero or more arguments, use the special argument varargin as the last (or only) argument in the list. varargin is a cell array containing all the input arguments. === Variable-length return lists === A function can be set up to return any number of values by using the special return value varargout. For example: === C++ integration === It is also possible to execute Octave code directly in a C++ program. For example, here is a code snippet for calling rand([10,1]): C and C++ code can be integrated into GNU Octave by creating oct files, or using the MATLAB compatible MEX files. == MATLAB compatibility == Octave has been built with MATLAB compatibility in mind, and shares many features with MATLAB: % Script: myscript.m a = 5; b = a 2 % Function: myfunc.m function result = myfunc(x) result = x^2 + 3; end Matrices as fundamental data type. Built-in support for complex numbers. Powerful built-in math functions and extensive function libraries. Extensibility in the form of user-defined functions. Octave treats incompatibility with MATLAB as a bug; therefore, it could be considered a software clone, which does not infringe software copyright as per Lotus v. Borland court case. MATLAB scripts from the MathWorks' FileExchange repository in principle are compatible with Octave. However, while they are often provided and uploaded by users under an Octave compatible and proper open source BSD license, the FileExchange Terms of use prohibit any usage beside MathWorks' proprietary MATLAB. === Syntax compatibility === There are a few purposeful, albeit minor, syntax additions Archived 2012-04-26 at the Wayback Machine: Comment lines can be prefixed with the # character as well as the % character; Various C-based operators ++, --, +=, =, /= are supported; Elements can be referenced without creating a new variable by cascaded indexing, e.g. [1:10](3); Strings can be defined with the double-quote " character as well as the single-quote ' character; When the variable type is single (a single-precision floating-point number), Octave calculates the "mean" in the single-domain (MATLAB in double-domain) which is faster but gives less accurate results; Blocks can also be terminated with more specific Control structure keywords, i.e., endif, endfor, endwhile, etc.; Functions can be defined within scripts and at the Octave prompt; Presence of a do-until loop (similar to do-while in C). === Function compatibility === Many, but not all, of the numerous MATLAB functions are available in GNU Octave, some of them accessible through packages in Octave Forge. The functions available as part of either core Octave or Forge packages are listed online Archived 2024-03-14 at the Wayback Machine. A list of unavailable functions is included in the Octave function __unimplemented.m__. Unimplemented functions are also listed under many Octave Forge packages in the Octave Wiki. When an unimplemented function is called the following error message is shown: == User interfaces == Octave comes with an official graphical user interface (GUI) and an integrated development environment (IDE) based on Qt. It has been available since Octave 3.8, and has become the default interface (over the command-line interface) with the release of Octave 4.0. It was well-received by an EDN contributor, who wrote "[Octave] now has a very workable GUI" in reviewing the then-new GUI in 2014. Several 3rd-party graphical front-ends have also been developed, like ToolboX for coding education. == GUI applications == With Octave code, the user can create GUI applications. See GUI Development (GNU Octave (version 7.1.0)). Below are some examples: Button, edit control, checkboxTextboxListbox wit

    Read more →
  • Multilinear subspace learning

    Multilinear subspace learning

    Multilinear subspace learning is an approach for disentangling the causal factor of data formation and performing dimensionality reduction. The Dimensionality reduction can be performed on a data tensor that contains a collection of observations that have been vectorized, or observations that are treated as matrices and concatenated into a data tensor. Here are some examples of data tensors whose observations are vectorized or whose observations are matrices concatenated into data tensor images (2D/3D), video sequences (3D/4D), and hyperspectral cubes (3D/4D). The mapping from a high-dimensional vector space to a set of lower dimensional vector spaces is a multilinear projection. When observations are retained in the same organizational structure as matrices or higher order tensors, their representations are computed by performing linear projections into the column space, row space and fiber space. Multilinear subspace learning algorithms are higher-order generalizations of linear subspace learning methods such as principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA) and canonical correlation analysis (CCA). == Background == Multilinear methods may be causal in nature and perform causal inference, or they may be simple regression methods from which no causal conclusion are drawn. Linear subspace learning algorithms are traditional dimensionality reduction techniques that are well suited for datasets that are the result of varying a single causal factor. Unfortunately, they often become inadequate when dealing with datasets that are the result of multiple causal factors. . Multilinear subspace learning can be applied to observations whose measurements were vectorized and organized into a data tensor for causally aware dimensionality reduction. These methods may also be employed in reducing horizontal and vertical redundancies irrespective of the causal factors when the observations are treated as a "matrix" (ie. a collection of independent column/row observations) and concatenated into a tensor. == Algorithms == === Multilinear principal component analysis === Historically, multilinear principal component analysis has been referred to as "M-mode PCA", a terminology which was coined by Peter Kroonenberg. In 2005, Vasilescu and Terzopoulos introduced the Multilinear PCA terminology as a way to better differentiate between multilinear tensor decompositions that computed 2nd order statistics associated with each data tensor mode, and subsequent work on Multilinear Independent Component Analysis that computed higher order statistics for each tensor mode. MPCA is an extension of PCA. === Multilinear independent component analysis === Multilinear independent component analysis is an extension of ICA. === Multilinear linear discriminant analysis === Multilinear extension of LDA TTP-based: Discriminant Analysis with Tensor Representation (DATER) TTP-based: General tensor discriminant analysis (GTDA) TVP-based: Uncorrelated Multilinear Discriminant Analysis (UMLDA) === Multilinear canonical correlation analysis === Multilinear extension of CCA TTP-based: Tensor Canonical Correlation Analysis (TCCA) TVP-based: Multilinear Canonical Correlation Analysis (MCCA) TVP-based: Bayesian Multilinear Canonical Correlation Analysis (BMTF) A TTP is a direct projection of a high-dimensional tensor to a low-dimensional tensor of the same order, using N projection matrices for an Nth-order tensor. It can be performed in N steps with each step performing a tensor-matrix multiplication (product). The N steps are exchangeable. This projection is an extension of the higher-order singular value decomposition (HOSVD) to subspace learning. Hence, its origin is traced back to the Tucker decomposition in 1960s. A TVP is a direct projection of a high-dimensional tensor to a low-dimensional vector, which is also referred to as the rank-one projections. As TVP projects a tensor to a vector, it can be viewed as multiple projections from a tensor to a scalar. Thus, the TVP of a tensor to a P-dimensional vector consists of P projections from the tensor to a scalar. The projection from a tensor to a scalar is an elementary multilinear projection (EMP). In EMP, a tensor is projected to a point through N unit projection vectors. It is the projection of a tensor on a single line (resulting a scalar), with one projection vector in each mode. Thus, the TVP of a tensor object to a vector in a P-dimensional vector space consists of P EMPs. This projection is an extension of the canonical decomposition, also known as the parallel factors (PARAFAC) decomposition. === Typical approach in MSL === There are N sets of parameters to be solved, one in each mode. The solution to one set often depends on the other sets (except when N=1, the linear case). Therefore, the suboptimal iterative procedure in is followed. Initialization of the projections in each mode For each mode, fixing the projection in all the other mode, and solve for the projection in the current mode. Do the mode-wise optimization for a few iterations or until convergence. This is originated from the alternating least square method for multi-way data analysis. == Code == MATLAB Tensor Toolbox by Sandia National Laboratories. The MPCA algorithm written in Matlab (MPCA+LDA included). The UMPCA algorithm written in Matlab (data included). The UMLDA algorithm written in Matlab (data included). == Tensor data sets == 3D gait data (third-order tensors): 128x88x20(21.2M); 64x44x20(9.9M); 32x22x10(3.2M);

    Read more →
  • Linear discriminant analysis

    Linear discriminant analysis

    Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables have a normal distribution, which is a fundamental assumption of the LDA method. LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on similarities rather than differences. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification - the act of distributing things into groups, classes or categories of the same type. == History == The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. == LDA for two classes == Consider a set of observations x → {\displaystyle {\vec {x}}} (also called features, attributes, variables or measurements) for each sample of an object or event with known class y {\displaystyle y} . This set of samples is called the training set in a supervised learning context. The classification problem is then to find a good predictor for the class y {\displaystyle y} of any sample of the same distribution (not necessarily from the training set) given only an observation x → {\displaystyle {\vec {x}}} . LDA approaches the problem by assuming that the conditional probability density functions p ( x → | y = 0 ) {\displaystyle p({\vec {x}}|y=0)} and p ( x → | y = 1 ) {\displaystyle p({\vec {x}}|y=1)} are both the normal distribution with mean and covariance parameters ( μ → 0 , Σ 0 ) {\displaystyle \left({\vec {\mu }}_{0},\Sigma _{0}\right)} and ( μ → 1 , Σ 1 ) {\displaystyle \left({\vec {\mu }}_{1},\Sigma _{1}\right)} , respectively. Under this assumption, the Bayes-optimal solution is to predict points as being from the second class if the log of the likelihood ratios is bigger than some threshold T, so that: 1 2 ( x → − μ → 0 ) T Σ 0 − 1 ( x → − μ → 0 ) + 1 2 ln ⁡ | Σ 0 | − 1 2 ( x → − μ → 1 ) T Σ 1 − 1 ( x → − μ → 1 ) − 1 2 ln ⁡ | Σ 1 | > T {\displaystyle {\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{0})^{\mathrm {T} }\Sigma _{0}^{-1}({\vec {x}}-{\vec {\mu }}_{0})+{\frac {1}{2}}\ln |\Sigma _{0}|-{\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{1})^{\mathrm {T} }\Sigma _{1}^{-1}({\vec {x}}-{\vec {\mu }}_{1})-{\frac {1}{2}}\ln |\Sigma _{1}|\ >\ T} Without any further assumptions, the resulting classifier is referred to as quadratic discriminant analysis (QDA). LDA instead makes the additional simplifying homoscedasticity assumption (i.e. that the class covariances are identical, so Σ 0 = Σ 1 = Σ {\displaystyle \Sigma _{0}=\Sigma _{1}=\Sigma } ) and that the covariances have full rank. In this case, several terms cancel: x → T Σ 0 − 1 x → = x → T Σ 1 − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }\Sigma _{0}^{-1}{\vec {x}}={\vec {x}}^{\mathrm {T} }\Sigma _{1}^{-1}{\vec {x}}} x → T Σ i − 1 μ → i = μ → i T Σ i − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {\mu }}_{i}={{\vec {\mu }}_{i}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {x}}} because both sides are scalar and transpose to each other ( Σ i {\displaystyle \Sigma _{i}} is Hermitian) and the above decision criterion becomes a threshold on the dot product w → T x → > c {\displaystyle {\vec {w}}^{\mathrm {T} }{\vec {x}}>c} for some threshold constant c, where w → = Σ − 1 ( μ → 1 − μ → 0 ) {\displaystyle {\vec {w}}=\Sigma ^{-1}({\vec {\mu }}_{1}-{\vec {\mu }}_{0})} c = 1 2 w → T ( μ → 1 + μ → 0 ) {\displaystyle c={\frac {1}{2}}\,{\vec {w}}^{\mathrm {T} }({\vec {\mu }}_{1}+{\vec {\mu }}_{0})} This means that the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of this linear combination of the known observations. It is often useful to see this conclusion in geometrical terms: the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of projection of multidimensional-space point x → {\displaystyle {\vec {x}}} onto vector w → {\displaystyle {\vec {w}}} (thus, we only consider its direction). In other words, the observation belongs to y {\displaystyle y} if corresponding x → {\displaystyle {\vec {x}}} is located on a certain side of a hyperplane perpendicular to w → {\displaystyle {\vec {w}}} . The location of the plane is defined by the threshold c {\displaystyle c} . == Assumptions == The assumptions of discriminant analysis are the same as those for MANOVA. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables. Multivariate normality: Independent variables are normal for each level of the grouping variable. Homogeneity of variance/covariance (homoscedasticity): Variances among group variables are the same across levels of predictors. Can be tested with Box's M statistic. It has been suggested, however, that linear discriminant analysis be used when covariances are equal, and that quadratic discriminant analysis may be used when covariances are not equal. Independence: Participants are assumed to be randomly sampled, and a participant's score on one variable is assumed to be independent of scores on that variable for all other participants. It has been suggested that discriminant analysis is relatively robust to slight violations of these assumptions, and it has also been shown that discriminant analysis may still be reliable when using dichotomous variables (where multivariate normality is often violated). == Discriminant functions == Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The number of functions possible is either N g − 1 {\displaystyle N_{g}-1} where N g {\displaystyle N_{g}} = number of groups, or p {\displaystyle p} (the number of predictors), whichever is smaller. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions. Given group j {\displaystyle j} , with R j {\displaystyle \mathbb {R} _{j}} sets of sample space, there is a discriminant rule such that if x ∈ R j {\displaystyle x\in \mathbb {R} _{j}} , then x ∈ j {\displaystyle x\in j} . Discriminant analysis then, finds “good” regions of R j {\displaystyle \mathbb {R} _{j}} to minimize classification error, therefore leading to a high percent correct classified in the classification table. Each function is given a discriminant score to determine how well it predicts group placement. Structure Corr

    Read more →
  • Transportation Economic Development Impact System

    Transportation Economic Development Impact System

    Transportation Economic Development Impact System (TREDIS) is an economic analysis system sold by consulting firm Economic Development Research Group that is used in planning major transportation investments in the US and Canada. The role of economic impact analysis and TREDIS in the transportation planning process is explained in guidebooks of the US Department of Transportation and the American Association of State Highway and Transportation Officials. TREDIS has been most commonly used for assessing the expected economic impacts of statewide highway programs, regional multi-modal plans and public transport investment. Its history and theoretical foundation are explained in peer reviewed journal articles. == How It Works == TREDIS has a series of modules that calculate different forms of impacts and benefits. One module is an accounting framework that calculates user benefits, including impacts on cargo transportation and commuting costs, based on transportation forecasting results. A second module calculates wider economic development benefits, including impacts on business productivity, economic development and multiplier effects from the input-output analysis. It applies an economic model to estimate impacts on jobs, income, gross regional product and business output, by sector of the economy. A third module applies cost-benefit analysis from alternative perspectives.

    Read more →
  • Genetic operator

    Genetic operator

    A genetic operator is an operator used in evolutionary algorithms (EA) to guide the algorithm towards a solution to a given problem. There are three main types of operators (mutation, crossover and selection), which must work in conjunction with one another in order for the algorithm to be successful. Genetic operators are used to create and maintain genetic diversity (mutation operator), combine existing solutions (also known as chromosomes) into new solutions (crossover) and select between solutions (selection). The classic representatives of evolutionary algorithms include genetic algorithms, evolution strategies, genetic programming and evolutionary programming. In his book discussing the use of genetic programming for the optimization of complex problems, computer scientist John Koza has also identified an 'inversion' or 'permutation' operator; however, the effectiveness of this operator has never been conclusively demonstrated and this operator is rarely discussed in the field of genetic programming. For combinatorial problems, however, these and other operators tailored to permutations are frequently used by other EAs. Mutation (or mutation-like) operators are said to be unary operators, as they only operate on one chromosome at a time. In contrast, crossover operators are said to be binary operators, as they operate on two chromosomes at a time, combining two existing chromosomes into one new chromosome. == Operators == Genetic variation is a necessity for the process of evolution. Genetic operators used in evolutionary algorithms are analogous to those in the natural world: survival of the fittest, or selection; reproduction (crossover, also called recombination); and mutation. === Selection === Selection operators give preference to better candidate solutions (chromosomes), allowing them to pass on their 'genes' to the next generation (iteration) of the algorithm. The best solutions are determined using some form of objective function (also known as a 'fitness function' in evolutionary algorithms), before being passed to the crossover operator. Different methods for choosing the best solutions exist, for example, fitness proportionate selection and tournament selection. A further or the same selection operator is used to determine the individuals for being selected to form the next parental generation. The selection operator may also ensure that the best solution(s) from the current generation always become(s) a member of the next generation without being altered; this is known as elitism or elitist selection. === Crossover === Crossover is the process of taking more than one parent solutions (chromosomes) and producing a child solution from them. By recombining portions of good solutions, the evolutionary algorithm is more likely to create a better solution. As with selection, there are a number of different methods for combining the parent solutions, including the edge recombination operator (ERO) and the 'cut and splice crossover' and 'uniform crossover' methods. The crossover method is often chosen to closely match the chromosome's representation of the solution; this may become particularly important when variables are grouped together as building blocks, which might be disrupted by a non-respectful crossover operator. Similarly, crossover methods may be particularly suited to certain problems; the ERO is considered a good option for solving the travelling salesman problem. === Mutation === The mutation operator encourages genetic diversity amongst solutions and attempts to prevent the evolutionary algorithm converging to a local minimum by stopping the solutions becoming too close to one another. In mutating the current pool of solutions, a given solution may change between slightly and entirely from the previous solution. By mutating the solutions, an evolutionary algorithm can reach an improved solution solely through the mutation operator. Again, different methods of mutation may be used; these range from a simple bit mutation (flipping random bits in a binary string chromosome with some low probability) to more complex mutation methods in which genes in the solution are changed, for example by adding a random value from the Gaussian distribution to the current gene value. As with the crossover operator, the mutation method is usually chosen to match the representation of the solution within the chromosome. == Combining operators == While each operator acts to improve the solutions produced by the evolutionary algorithm working individually, the operators must work in conjunction with each other for the algorithm to be successful in finding a good solution. Using the selection operator on its own will tend to fill the solution population with copies of the best solution from the population. If the selection and crossover operators are used without the mutation operator, the algorithm will tend to converge to a local minimum, that is, a good but sub-optimal solution to the problem. Using the mutation operator on its own leads to a random walk through the search space. Only by using all three operators together can the evolutionary algorithm become a noise-tolerant global search algorithm, yielding good solutions to the problem at hand.

    Read more →
  • Nearest neighbor search

    Nearest neighbor search

    Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

    Read more →
  • Promoter based genetic algorithm

    Promoter based genetic algorithm

    The promoter based genetic algorithm (PBGA) is a genetic algorithm for neuroevolution developed by F. Bellas and R.J. Duro in the Integrated Group for Engineering Research (GII) at the University of Coruña, in Spain. It evolves variable size feedforward artificial neural networks (ANN) that are encoded into sequences of genes for constructing a basic ANN unit. Each of these blocks is preceded by a gene promoter acting as an on/off switch that determines if that particular unit will be expressed or not. == PBGA basics == The basic unit in the PBGA is a neuron with all of its inbound connections as represented in the following figure: The genotype of a basic unit is a set of real valued weights followed by the parameters of the neuron and proceeded by an integer valued field that determines the promoter gene value and, consequently, the expression of the unit. By concatenating units of this type we can construct the whole network. With this encoding it is imposed that the information that is not expressed is still carried by the genotype in evolution but it is shielded from direct selective pressure, maintaining this way the diversity in the population, which has been a design premise for this algorithm. Therefore, a clear difference is established between the search space and the solution space, permitting information learned and encoded into the genotypic representation to be preserved by disabling promoter genes. == Results == The PBGA was originally presented within the field of autonomous robotics, in particular in the real time learning of environment models of the robot. It has been used inside the Multilevel Darwinist Brain (MDB) cognitive mechanism developed in the GII for real robots on-line learning. In another paper it is shown how the application of the PBGA together with an external memory that stores the successful obtained world models, is an optimal strategy for adaptation in dynamic environments. Recently, the PBGA has provided results that outperform other neuroevolutionary algorithms in non-stationary problems, where the fitness function varies in time.

    Read more →
  • JAX (software)

    JAX (software)

    JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. It is developed by Google with contributions from Nvidia and other community contributors. It is described as bringing together a modified version of the automatic differentiation system autograd and OpenXLA's XLA (Accelerated Linear Algebra). It is designed to follow the structure and workflow of NumPy as closely as possible and works with various existing frameworks such as TensorFlow and PyTorch. The primary features of JAX are: Providing a unified NumPy-like interface to computations that run on CPU, GPU, or TPU, in local or distributed settings. Built-in Just-In-Time (JIT) compilation via OpenXLA, an open-source machine learning compiler ecosystem. Efficient evaluation of gradients via its automatic differentiation transformations. Automatic vectorization to efficiently map functions over arrays representing batches of inputs. == Libraries using Jax == Flax Equinox Optax

    Read more →
  • Targeted maximum likelihood estimation

    Targeted maximum likelihood estimation

    Targeted Maximum Likelihood Estimation (TMLE) (also more accurately referred to as Targeted Minimum Loss-Based Estimation) is a general statistical estimation framework for causal inference and semiparametric models. TMLE combines ideas from maximum likelihood estimation, semiparametric efficiency theory, and machine learning. It was introduced by Mark J. van der Laan and colleagues in the mid-2000s as a method that yields asymptotically efficient plug-in estimators while allowing the use of flexible, data-adaptive algorithms such as ensemble machine learning for nuisance parameter estimation. TMLE is used in epidemiology, biostatistics, and the social sciences to estimate causal effects in observational and experimental studies. Applications of TMLE include Longitudinal TMLE (LTMLE) for time-varying treatments and confounders. Variations in how the targeting step in TMLE is carried out have resulted in various versions of TMLE such as Collaborative TMLE (CTMLE) and Adaptive TMLE for improved finite-sample performance and automated variable selection. == History == The TMLE framework was first described by van der Laan and Rubin (2006) as a general approach for the construction of efficient plug-in estimators of smooth features of the data density. It was demonstrated in the context of causal inference and missing data problems. It was developed to address limitations of traditional doubly robust methods, such as Augmented Inverse Probability Weighting (AIPW), by respecting the plug-in principle in the sense that it respects that the target parameter is a function of the data density that is an element of the statistical model. TMLE estimates the data density or relevant parts of it with machine learning and targets these machine learning fits before it is plugged in the target parameter mapping. In this manner, a TMLE always respects global knowledge and satisfies known bounds such as that the target parameter is a probability . Since its introduction, TMLE has been developed in a series of theoretical and applied papers, culminating in book-length treatments of the method and its applications to survival analysis, adaptive designs, and longitudinal data. == Methodology == At its core, TMLE is a two-step estimation procedure: Initial estimation: Machine learning methods (such as the Super Learner ensemble) are used to obtain flexible estimates of nuisance parameters, such as outcome regressions and propensity scores. Targeting step: The initial estimate is updated by solving a score equation (the efficient influence function) so that the final estimator is consistent, asymptotically normal, and efficient under mild regularity conditions. The targeted machine learning fit is then mapped into the corresponding estimator of the target parameter by simply plugging it in the target parameter mapping. This approach balances the bias–variance trade-off by combining data-adaptive estimation with semiparametric efficiency theory. TMLE is doubly robust, meaning it remains consistent if either the outcome model or the treatment model is consistently estimated. === Formula === Here we explain the TMLE of the average treatment effect of a binary treatment on an outcome adjusting for baseline covariates. Consider i.i.d. observations O i = ( W i , A i , Y i ) {\displaystyle O_{i}=(W_{i},A_{i},Y_{i})} from a distribution P 0 {\displaystyle P_{0}} , where W {\displaystyle W} are baseline covariates, A {\displaystyle A} is a binary treatment, and Y {\displaystyle Y} is an outcome. Let Q ¯ ( a , w ) = E [ Y ∣ A = a , W = w ] {\displaystyle {\bar {Q}}(a,w)=\mathbb {E} [Y\mid A=a,W=w]} represent the outcome model and g ( a ∣ w ) = P ( A = a ∣ W = w ) {\displaystyle g(a\mid w)=P(A=a\mid W=w)} represent the propensity score. The average treatment effect (ATE) is given by ψ 0 = E { Q ¯ ( 1 , W ) − Q ¯ ( 0 , W ) } . {\displaystyle \psi _{0}=\mathbb {E} \{{\bar {Q}}(1,W)-{\bar {Q}}(0,W)\}.} A basic TMLE for the ATE proceeds as follows: Step 1: Estimate initial models. Obtain estimates Q ¯ ^ ( a , w ) {\displaystyle {\hat {\bar {Q}}}(a,w)} and g ^ ( a ∣ w ) {\displaystyle {\hat {g}}(a\mid w)} , often using flexible methods such as Super Learner. Step 2: Compute the clever covariate. Define: H ( A , W ) = A g ^ ( 1 ∣ W ) − 1 − A g ^ ( 0 ∣ W ) . {\displaystyle H(A,W)={\frac {A}{{\hat {g}}(1\mid W)}}-{\frac {1-A}{{\hat {g}}(0\mid W)}}.} Step 3: Estimate the fluctuation parameter. Fit a logistic regression of Y {\displaystyle Y} on H ( A , W ) {\displaystyle H(A,W)} with logit ⁡ ( Q ¯ ^ ( A , W ) ) {\displaystyle \operatorname {logit} ({\hat {\bar {Q}}}(A,W))} as offset. This yields ε ^ {\displaystyle {\hat {\varepsilon }}} , the MLE that solves the score equation: 1 n ∑ i = 1 n H ( A i , W i ) { Y i − Q ¯ ^ ε ( A i , W i ) } = 0. {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}H(A_{i},W_{i}){\big \{}Y_{i}-{\hat {\bar {Q}}}^{\varepsilon }(A_{i},W_{i}){\big \}}=0.} Step 4: Update the initial estimate. Apply the "blip" to obtain the targeted estimate: Q ¯ ^ ∗ ( A , W ) = expit ⁡ ( logit ⁡ ( Q ¯ ^ ( A , W ) ) + ε ^ H ( A , W ) ) . {\displaystyle {\hat {\bar {Q}}}^{}(A,W)=\operatorname {expit} {\Big (}\operatorname {logit} {\big (}{\hat {\bar {Q}}}(A,W){\big )}+{\hat {\varepsilon }}\,H(A,W){\Big )}.} Step 5: Compute the TMLE. The ATE estimate is: ψ ^ TMLE = 1 n ∑ i = 1 n [ Q ¯ ^ ∗ ( 1 , W i ) − Q ¯ ^ ∗ ( 0 , W i ) ] . {\displaystyle {\hat {\psi }}_{\text{TMLE}}={\frac {1}{n}}\sum _{i=1}^{n}{\big [}{\hat {\bar {Q}}}^{}(1,W_{i})-{\hat {\bar {Q}}}^{}(0,W_{i}){\big ]}.} Inference. The efficient influence function (EIF) for the ATE is: D ∗ ( O ) = H ( A , W ) { Y − Q ¯ ∗ ( A , W ) } + Q ¯ ∗ ( 1 , W ) − Q ¯ ∗ ( 0 , W ) − ψ . {\displaystyle D^{}(O)=H(A,W)\{Y-{\bar {Q}}^{}(A,W)\}+{\bar {Q}}^{}(1,W)-{\bar {Q}}^{}(0,W)-\psi .} The variance is estimated by σ ^ 2 = n − 1 ∑ i = 1 n ( D ∗ ( O i ) ) 2 {\displaystyle {\hat {\sigma }}^{2}=n^{-1}\sum _{i=1}^{n}{\big (}D^{}(O_{i}){\big )}^{2}} , yielding Wald-type confidence intervals ψ ^ TMLE ± z 1 − α / 2 σ ^ / n {\displaystyle {\hat {\psi }}_{\text{TMLE}}\pm z_{1-\alpha /2}\,{\hat {\sigma }}/{\sqrt {n}}} . Remark. For continuous outcomes, a linear fluctuation Q ¯ ^ ∗ = Q ¯ ^ + ε ^ H {\displaystyle {\hat {\bar {Q}}}^{}={\hat {\bar {Q}}}+{\hat {\varepsilon }}\,H} may be used instead. For bounded continuous outcomes, the logistic fluctuation (after rescaling Y {\displaystyle Y} to [ 0 , 1 ] {\displaystyle [0,1]} ) is often preferred for improved finite-sample performance. == Applications == TMLE has been applied in: Epidemiology: Estimating causal effects of exposures and interventions in observational cohort studies. Clinical trials and real-world evidence: The Targeted Learning roadmap provides a structured framework for generating and validating real-world evidence (RWE), bridging randomized trials and observational data using TMLE and related estimation techniques. This approach enables transparency, sensitivity analysis, and stronger causal inference for regulatory and clinical trial contexts. High-dimensional settings: Integration with ensemble methods for causal effect estimation. TMLE has been successfully applied in pharmacoepidemiology where a large number of covariates are automatically selected to adjust for confounding. In a study of post–myocardial infarction statin use and 1-year mortality, TMLE demonstrated robust performance relative to inverse probability weighting in scenarios with hundreds of potential confounders. == Derivatives and extensions == Longitudinal TMLE (LTMLE): A methodological extension of TMLE for longitudinal data with time-varying treatments, confounders, and censoring. It allows the estimation of dynamic treatment regimes and intervention-specific causal effects over time. This framework was originally introduced by van der Laan & Gruber (2012). Collaborative TMLE (CTMLE): Enhances finite-sample performance and variable selection by collaboratively fitting the treatment mechanism in conjunction with the target parameter. == Software == Several R packages implement TMLE and related methods: tmle: Functions for binary, categorical, and continuous outcomes. ltmle: Implementation for longitudinal data with time-varying treatments and outcomes. ctmle: Algorithms for collaborative TMLE and adaptive variable selection. SuperLearner: A theoretically grounded, cross-validated ensemble learning method that combines predictions from multiple algorithms to minimize predictive risk. Widely used in TMLE for estimating nuisance parameters. The original implementation is available as the R package SuperLearner. Recent machine learning platforms like H2O AutoML implement similar ensemble strategies, combining diverse learners in parallel and leveraging stacking and blending techniques, effectively functioning as a large-scale Super Learner.

    Read more →
  • Apache Giraph

    Apache Giraph

    Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

    Read more →
  • Dendrogram

    Dendrogram

    A dendrogram is a diagram representing a tree graph. This diagrammatic representation is frequently used in different contexts: in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses. in computational biology, it shows the clustering of genes or samples, sometimes in the margins of heatmaps. in phylogenetics, it displays the evolutionary relationships among various biological taxa. In this case, the dendrogram is also called a phylogenetic tree. The name dendrogram derives from the two ancient greek words δένδρον (déndron), meaning "tree", and γράμμα (grámma), meaning "drawing, mathematical figure". == Clustering example == For a clustering example, suppose that five taxa ( a {\displaystyle a} to e {\displaystyle e} ) have been clustered by UPGMA based on a matrix of genetic distances. The hierarchical clustering dendrogram would show a column of five nodes representing the initial data (here individual taxa), and the remaining nodes represent the clusters to which the data belong, with the arrows representing the distance (dissimilarity). The distance between merged clusters is monotone, increasing with the level of the merger: the height of each node in the plot is proportional to the value of the intergroup dissimilarity between its two daughters (the nodes on the right representing individual observations all plotted at zero height).

    Read more →
  • Caffe (software)

    Caffe (software)

    Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework, originally developed at University of California, Berkeley. It is open source, under a BSD license. It is written in C++, with a Python interface. == History == Yangqing Jia created the Caffe project during his PhD at UC Berkeley, while working the lab of Trevor Darrell. The first version, called "DeCAF", made its first appearance in Spring 2013 when it was used for the ILSVRC challenge (later called ImageNet). The library was named Caffe and released to the public in December 2013. It reached end-of-support in 2018. It is hosted on GitHub. == Features == Caffe supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully-connected neural network designs. Caffe supports GPU- and CPU-based acceleration computational kernel libraries such as Nvidia cuDNN and Intel MKL. == Applications == Caffe is being used in academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Yahoo! has also integrated Caffe with Apache Spark to create CaffeOnSpark, a distributed deep learning framework. == Caffe2 == In April 2017, Facebook announced Caffe2, which included new features such as recurrent neural network (RNN). At the end of March 2018, Caffe2 was merged into PyTorch.

    Read more →
  • Q-learning

    Q-learning

    Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations. For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state. == Reinforcement learning == Reinforcement learning involves an agent, a set of states S {\displaystyle {\mathcal {S}}} , and a set A {\displaystyle {\mathcal {A}}} of actions per state. By performing an action a ∈ A {\displaystyle a\in {\mathcal {A}}} , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy. == Algorithm == After Δ t {\displaystyle \Delta t} steps into the future the agent will decide some next step. The weight for this step is calculated as γ Δ t {\displaystyle \gamma ^{\Delta t}} , where γ {\displaystyle \gamma } (the discount factor) is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ). Assuming γ < 1 {\displaystyle \gamma <1} , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). γ {\displaystyle \gamma } may also be interpreted as the probability to succeed (or survive) at every step Δ t {\displaystyle \Delta t} . The algorithm, therefore, has a function that calculates the quality of a state–action combination: Q : S × A → R {\displaystyle Q:{\mathcal {S}}\times {\mathcal {A}}\to \mathbb {R} } . Before learning begins, ⁠ Q {\displaystyle Q} ⁠ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action A t {\displaystyle A_{t}} , observes a reward R t + 1 {\displaystyle R_{t+1}} , enters a new state S t + 1 {\displaystyle S_{t+1}} (that may depend on both the previous state S t {\displaystyle S_{t}} and the selected action), and Q {\displaystyle Q} is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: Q n e w ( S t , A t ) ← ( 1 − α ⏟ learning rate ) ⋅ Q ( S t , A t ) ⏟ current value + α ⏟ learning rate ⋅ ( R t + 1 ⏟ reward + γ ⏟ discount factor ⋅ max a Q ( S t + 1 , a ) ⏟ estimate of optimal future value ⏟ new value (temporal difference target) ) {\displaystyle Q^{new}(S_{t},A_{t})\leftarrow (1-\underbrace {\alpha } _{\text{learning rate}})\cdot \underbrace {Q(S_{t},A_{t})} _{\text{current value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot {\bigg (}\underbrace {\underbrace {R_{t+1}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(S_{t+1},a)} _{\text{estimate of optimal future value}}} _{\text{new value (temporal difference target)}}{\bigg )}} where R t + 1 {\displaystyle R_{t+1}} is the reward received when moving from the state S t {\displaystyle S_{t}} to the state S t + 1 {\displaystyle S_{t+1}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 ) {\displaystyle (0<\alpha \leq 1)} . Note that Q n e w ( S t , A t ) {\displaystyle Q^{new}(S_{t},A_{t})} is the sum of three terms: ( 1 − α ) Q ( S t , A t ) {\displaystyle (1-\alpha )Q(S_{t},A_{t})} : the current value (weighted by one minus the learning rate) α R t + 1 {\displaystyle \alpha \,R_{t+1}} : the reward R t + 1 {\displaystyle R_{t+1}} to obtain if action A t {\displaystyle A_{t}} is taken when in state S t {\displaystyle S_{t}} (weighted by learning rate) α γ max a Q ( S t + 1 , a ) {\displaystyle \alpha \gamma \max _{a}Q(S_{t+1},a)} : the maximum reward that can be obtained from state S t + 1 {\displaystyle S_{t+1}} (weighted by learning rate and discount factor) An episode of the algorithm ends when state S t + 1 {\displaystyle S_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths. For all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated, but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to equal zero. == Influence of variables == === Learning rate === The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} . === Discount factor === The discount factor ⁠ γ {\displaystyle \gamma } ⁠ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. r t {\displaystyle r_{t}} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ⁠ γ = 1 {\displaystyle \gamma =1} ⁠, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. === Initial conditions (Q0) === Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward r {\displaystyle r} can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value

    Read more →
  • Probably approximately correct learning

    Probably approximately correct learning

    In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). == Definitions and terminology == In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology. For the following definitions, two examples will be used. The first is the problem of character recognition given an array of n {\displaystyle n} bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative. Let X {\displaystyle X} be a set called the instance space or the encoding of all the samples. In the character recognition problem, the instance space is X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} . In the interval problem the instance space, X {\displaystyle X} , is the set of all bounded intervals in R {\displaystyle \mathbb {R} } , where R {\displaystyle \mathbb {R} } denotes the set of all real numbers. A concept is a subset c ⊂ X {\displaystyle c\subset X} . One concept is the set of all patterns of bits in X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} that encode a picture of the letter "P". An example concept from the second example is the set of open intervals, { ( a , b ) ∣ 0 ≤ a ≤ π / 2 , π ≤ b ≤ 13 } {\displaystyle \{(a,b)\mid 0\leq a\leq \pi /2,\pi \leq b\leq {\sqrt {13}}\}} , each of which contains only the positive points. A concept class C {\displaystyle C} is a collection of concepts over X {\displaystyle X} . This could be the set of all subsets of the array of bits that are skeletonized 4-connected (width of the font is 1). Let EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} be a procedure that draws an example, x {\displaystyle x} , using a probability distribution D {\displaystyle D} and gives the correct label c ( x ) {\displaystyle c(x)} , that is 1 if x ∈ c {\displaystyle x\in c} and 0 otherwise. Now, given 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , assume there is an algorithm A {\displaystyle A} and a polynomial p {\displaystyle p} in 1 / ϵ , 1 / δ {\displaystyle 1/\epsilon ,1/\delta } (and other relevant parameters of the class C {\displaystyle C} ) such that, given a sample of size p {\displaystyle p} drawn according to EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} , then, with probability of at least 1 − δ {\displaystyle 1-\delta } , A {\displaystyle A} outputs a hypothesis h ∈ C {\displaystyle h\in C} that has an average error less than or equal to ϵ {\displaystyle \epsilon } on X {\displaystyle X} with the same distribution D {\displaystyle D} . Further if the above statement for algorithm A {\displaystyle A} is true for every concept c ∈ C {\displaystyle c\in C} and for every distribution D {\displaystyle D} over X {\displaystyle X} , and for all 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} then C {\displaystyle C} is (efficiently) PAC learnable (or distribution-free PAC learnable). We can also say that A {\displaystyle A} is a PAC learning algorithm for C {\displaystyle C} . == Equivalence == Under some regularity conditions these conditions are equivalent: The concept class C is PAC learnable. The VC dimension of C is finite. C is a uniformly Glivenko-Cantelli class. C is compressible in the sense of Littlestone and Warmuth

    Read more →