In theoretical computer science, the separating words problem is the problem of finding the smallest deterministic finite automaton that behaves differently on two given strings, meaning that it accepts one of the two strings and rejects the other string. It is an open problem how large such an automaton must be, in the worst case, as a function of the length of the input strings. == Example == The two strings 0010 and 1000 may be distinguished from each other by a three-state automaton in which the transitions from the start state go to two different states, both of which are terminal in the sense that subsequent transitions from these two states always return to the same state. The state of this automaton records the first symbol of the input string. If one of the two terminal states is accepting and the other is rejecting, then the automaton will accept only one of the strings 0010 and 1000. However, these two strings cannot be distinguished by any automaton with fewer than three states. == Simplifying assumptions == For proving bounds on this problem, it may be assumed without loss of generality that the inputs are strings over a two-letter alphabet. For, if two strings over a larger alphabet differ then there exists a string homomorphism that maps them to binary strings of the same length that also differ. Any automaton that distinguishes the binary strings can be translated into an automaton that distinguishes the original strings, without any increase in the number of states. It may also be assumed that the two strings have equal length. For strings of unequal length, there always exists a prime number p whose value is logarithmic in the smaller of the two input lengths, such that the two lengths are different modulo p. An automaton that counts the length of its input modulo p can be used to distinguish the two strings from each other in this case. Therefore, strings of unequal lengths can always be distinguished from each other by automata with few states. == History and bounds == The problem of bounding the size of an automaton that distinguishes two given strings was first formulated by Goralčík & Koubek (1986), who showed that the automaton size is always sublinear. Later, Robson (1989) proved the upper bound O(n2/5(log n)3/5) on the automaton size that may be required. This was improved by Chase (2020) to O(n1/3(log n)7). There exist pairs of inputs that are both binary strings of length n for which any automaton that distinguishes the inputs must have size Ω(log n). Closing the gap between this lower bound and Chase's upper bound remains an open problem. Jeffrey Shallit has offered a prize of 100 British pounds for any improvement to Robson's upper bound. == Special cases == Several special cases of the separating words problem are known to be solvable using few states: If two binary words have differing numbers of zeros or ones, then they can be distinguished from each other by counting their Hamming weights modulo a prime of logarithmic size, using a logarithmic number of states. More generally, if a pattern of length k appears a different number of times in the two words, they can be distinguished from each other using O(k log n) states. If two binary words differ from each other within their first or last k positions, they can be distinguished from each other using k + O(1) states. This implies that almost all pairs of binary words can be distinguished from each other with a logarithmic number of states, because only a polynomially small fraction of pairs have no difference in their initial O(log n) positions. If two binary words have Hamming distance d, then there exists a prime p with p = O(d log n) and a position i at which the two strings differ, such that i is not equal modulo p to the position of any other difference. By computing the parity of the input symbols at positions congruent to i modulo p, it is possible to distinguish the words using an automaton with O(d log n) states.
List of large language models
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text. == List == For the training cost column, 1 petaFLOP-day equals 1 petaFLOP/sec × 1 day, or 8.64×1019 FLOP (floating point operations). Only the cost of the largest model is shown. The number of parameters is measured in billions, and the training cost is measured in petaFLOP-days. === 2018 === === 2019 === === 2020 === === 2021 === === 2022 === === 2023 === === 2024 === === 2025 === === 2026 ===
Aleph (ILP)
Aleph (A Learning Engine for Proposing Hypotheses) is an inductive logic programming system introduced by Ashwin Srinivasan in 2001. As of 2022 it is still one of the most widely used inductive logic programming systems. It is based on the earlier system Progol. == Learning task == The input to Aleph is background knowledge, specified as a logic program, a language bias in the form of mode declarations, as well as positive and negative examples specified as ground facts. As output it returns a logic program which, together with the background knowledge, entails all of the positive examples and none of the negative examples. == Basic algorithm == Starting with an empty hypothesis, Aleph proceeds as follows: It chooses a positive example to generalise; if none are left, it aborts and outputs the current hypothesis. Then it constructs the bottom clause, that is, the most specific clause that is allowed by the mode declarations and covers the example. It then searches for a generalisation of the bottom clause that scores better on the chosen metric. It then adds the new clause to the hypothesis program and removes all examples that are covered by the new clause. == Search algorithm == Aleph searches for clauses in a top-down manner, using the bottom clause constructed in the preceding step to bound the search from below. It searches the refinement graph in a breadth-first manner, with tunable parameters to bound the maximal clause size and proof depth. It scores each clause using one of 13 different evaluation metrics, as chosen in advance by the user.
Policy gradient method
Policy gradient methods are a class of reinforcement learning algorithms and a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } . == Overview == In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( ⋅ ∣ s ) {\displaystyle \pi _{\theta }(\cdot \mid s)} . If the action space is discrete, then ∑ a π θ ( a ∣ s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1} . If the action space is continuous, then ∫ a π θ ( a ∣ s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1} . The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ ∑ t = 0 T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite). The policy gradient is defined as ∇ θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". == REINFORCE == === Policy gradient === The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first policy gradient method. It is based on the identity for the policy gradient ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln π θ ( A t ∣ S t ) ∑ t = 0 T ( γ t R t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t=0}^{T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick" ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln π θ ( A t ∣ S t ) ∑ τ = t T ( γ τ R τ ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]} Thus, we have an unbiased estimator of the policy gradient: ∇ θ J ( θ ) ≈ 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ ln π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ − t R τ , n ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau -t}R_{\tau ,n})\right]} where the index n {\displaystyle n} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} . The score function ∇ θ ln π θ ( A t ∣ S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === The REINFORCE algorithm is a loop: Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function. Compute the policy gradient estimation: g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ R τ , n ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})\right]} Update the policy by gradient ascent: θ i + 1 ← θ i + α i g i {\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}} Here, α i {\displaystyle \alpha _{i}} is the learning rate at update step i {\displaystyle i} . == Variance reduction == REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title of variance reduction. === REINFORCE with baseline === A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − b ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]} for any function b : States → R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln π θ ( A t , n | S t , n ) ( ∑ τ = t T ( γ τ R τ , n ) − b i ( S t , n ) ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case where b i ≡ 0 {\displaystyle b_{i}\equiv 0} . === Actor-critic methods === If b i {\textstyle b_{i}} is chosen well, such that b i ( S t ) ≈ ∑ τ = t T ( γ τ R τ ) = γ t V π θ i ( S t ) {\textstyle b_{i}(S_{t})\approx \sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − γ t V π θ ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic. The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln π θ ( A t | S t ) ⋅ Q π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) − V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln π θ ( A t | S t ) ⋅ A π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} In summary, there are many unbiased estimators for ∇ θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T ∇ θ ln π θ ( A t | S t ) ⋅ Ψ t | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\su
Multiple correspondence analysis
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. == As an extension of correspondence analysis == MCA is performed by applying the CA algorithm to either an indicator matrix (also called complete disjunctive table – CDT) or a Burt table formed from these variables. An indicator matrix is an individuals × variables matrix, where the rows represent individuals and the columns are dummy variables representing categories of the variables. Analyzing the indicator matrix allows the direct representation of individuals as points in geometric space. The Burt table is the symmetric matrix of all two-way cross-tabulations between the categorical variables, and has an analogy to the covariance matrix of continuous variables. Analyzing the Burt table is a more natural generalization of simple correspondence analysis, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display. In the indicator matrix approach, associations between variables are uncovered by calculating the chi-square distance between different categories of the variables and between the individuals (or respondents). These associations are then represented graphically as "maps", which eases the interpretation of the structures in the data. Oppositions between rows and columns are then maximized, in order to uncover the underlying dimensions best able to describe the central oppositions in the data. As in factor analysis or principal component analysis, the first axis is the most important dimension, the second axis the second most important, and so on, in terms of the amount of variance accounted for. The number of axes to be retained for analysis is determined by calculating modified eigenvalues. == Details == Since MCA is adapted to draw statistical conclusions from categorical variables (such as multiple choice questions), the first thing one needs to do is to transform quantitative data (such as age, size, weight, day time, etc) into categories (using for instance statistical quantiles). When the dataset is completely represented as categorical variables, one is able to build the corresponding so-called complete disjunctive table. We denote this table X {\displaystyle X} . If I {\displaystyle I} persons answered a survey with J {\displaystyle J} multiple choices questions with 4 answers each, X {\displaystyle X} will have I {\displaystyle I} rows and 4 J {\displaystyle 4J} columns. More theoretically, assume X {\displaystyle X} is the completely disjunctive table of I {\displaystyle I} observations of K {\displaystyle K} categorical variables. Assume also that the k {\displaystyle k} -th variable have J k {\displaystyle J_{k}} different levels (categories) and set J = ∑ k = 1 K J k {\displaystyle J=\sum _{k=1}^{K}J_{k}} . The table X {\displaystyle X} is then a I × J {\displaystyle I\times J} matrix with all coefficient being 0 {\displaystyle 0} or 1 {\displaystyle 1} . Set the sum of all entries of X {\displaystyle X} to be N {\displaystyle N} and introduce Z = X / N {\displaystyle Z=X/N} . In an MCA, there are also two special vectors: first r {\displaystyle r} , that contains the sums along the rows of Z {\displaystyle Z} , and c {\displaystyle c} , that contains the sums along the columns of Z {\displaystyle Z} . Note D r = diag ( r ) {\displaystyle D_{r}={\text{diag}}(r)} and D c = diag ( c ) {\displaystyle D_{c}={\text{diag}}(c)} , the diagonal matrices containing r {\displaystyle r} and c {\displaystyle c} respectively as diagonal. With these notations, computing an MCA consists essentially in the singular value decomposition of the matrix: M = D r − 1 / 2 ( Z − r c T ) D c − 1 / 2 {\displaystyle M=D_{r}^{-1/2}(Z-rc^{T})D_{c}^{-1/2}} The decomposition of M {\displaystyle M} gives you P {\displaystyle P} , Δ {\displaystyle \Delta } and Q {\displaystyle Q} such that M = P Δ Q T {\displaystyle M=P\Delta Q^{T}} with P, Q two unitary matrices and Δ {\displaystyle \Delta } is the generalized diagonal matrix of the singular values (with the same shape as Z {\displaystyle Z} ). The positive coefficients of Δ 2 {\displaystyle \Delta ^{2}} are the eigenvalues of Z {\displaystyle Z} . The interest of MCA comes from the way observations (rows) and variables (columns) in Z {\displaystyle Z} can be decomposed. This decomposition is called a factor decomposition. The coordinates of the observations in the factor space are given by F = D r − 1 / 2 P Δ {\displaystyle F=D_{r}^{-1/2}P\Delta } The i {\displaystyle i} -th rows of F {\displaystyle F} represent the i {\displaystyle i} -th observation in the factor space. And similarly, the coordinates of the variables (in the same factor space as observations!) are given by G = D c − 1 / 2 Q Δ {\displaystyle G=D_{c}^{-1/2}Q\Delta } == Recent works and extensions == In recent years, several students of Jean-Paul Benzécri have refined MCA and incorporated it into a more general framework of data analysis known as geometric data analysis. This involves the development of direct connections between simple correspondence analysis, principal component analysis and MCA with a form of cluster analysis known as Euclidean classification. Two extensions have great practical use. It is possible to include, as active elements in the MCA, several quantitative variables. This extension is called factor analysis of mixed data (see below). Very often, in questionnaires, the questions are structured in several issues. In the statistical analysis it is necessary to take into account this structure. This is the aim of multiple factor analysis which balances the different issues (i.e. the different groups of variables) within a global analysis and provides, beyond the classical results of factorial analysis (mainly graphics of individuals and of categories), several results (indicators and graphics) specific of the group structure. == Application fields == In the social sciences, MCA is arguably best known for its application by Pierre Bourdieu, notably in his books La Distinction, Homo Academicus and The State Nobility. Bourdieu argued that there was an internal link between his vision of the social as spatial and relational --– captured by the notion of field, and the geometric properties of MCA. Sociologists following Bourdieu's work most often opt for the analysis of the indicator matrix, rather than the Burt table, largely because of the central importance accorded to the analysis of the 'cloud of individuals'. == Multiple correspondence analysis and principal component analysis == MCA can also be viewed as a PCA applied to the complete disjunctive table. To do this, the CDT must be transformed as follows. Let y i k {\displaystyle y_{ik}} denote the general term of the CDT. y i k {\displaystyle y_{ik}} is equal to 1 if individual i {\displaystyle i} possesses the category k {\displaystyle k} and 0 if not. Let denote p k {\displaystyle p_{k}} , the proportion of individuals possessing the category k {\displaystyle k} . The transformed CDT (TCDT) has as general term: x i k = y i k / p k − 1 {\displaystyle x_{ik}=y_{ik}/p_{k}-1} The unstandardized PCA applied to TCDT, the column k {\displaystyle k} having the weight p k {\displaystyle p_{k}} , leads to the results of MCA. This equivalence is fully explained in a book by Jérôme Pagès. It plays an important theoretical role because it opens the way to the simultaneous treatment of quantitative and qualitative variables. Two methods simultaneously analyze these two types of variables: factor analysis of mixed data and, when the active variables are partitioned in several groups: multiple factor analysis. This equivalence does not mean that MCA is a particular case of PCA as it is not a particular case of CA. It only means that these methods are closely linked to one another, as they belong to the same family: the factorial methods. == Software == There are numerous software of data analysis that include MCA, such as STATA and SPSS. The R package FactoMineR also features MCA. This software is related to a book describing the basic methods for performing MCA . There is also a Python package for [1] which works with numpy array matrices; the package has not been implemented yet for Spark dataframes.
PCPaint
PCPaint was one of the first IBM PC-based mouse-driven GUI paint programs, released in 1984. It followed after Microsoft Doodle, released in 1983 with the Microsoft Mouse version 1 drivers for DOS, and around the same time as Digital Research’s Draw program. It was developed and created by John Bridges and Doug Wolfgram. It was later developed into Pictor Paint. The hardware manufacturer Mouse Systems bundled PCPaint with millions of computer mice that they sold, making PCPaint one of the best-selling DOS-based paint programs of the mid 1980s. == History == In 1983, Doug Wolfgram bought a Microsoft Mouse and decided to write a drawing program for it. They named it “Mouse Draw”. The interface was primitive but the program functioned well. Wolfgram traveled to SoftCon in New Orleans where he demonstrated the program to Mouse Systems. Mouse Systems was developing an optical mouse and they wanted to bundle a painting program so they agreed to publish Mouse Draw. The original program was written entirely in assembly language with primitive graphics routines developed by Wolfgram. John Bridges worked for an educational software company, Classroom Consortia Media, Inc., developing and writing Apple and IBM graphics libraries for CCM's software. Bridges and Wolfgram were friends who had been connected through a bulletin board system developed and run by Wolfgram. The two collaborated cross country via the BBS, Wolfram in California and Bridges in New York. Mouse Systems wanted the paint program to capture the look and feel of MacPaint. John Bridges and Doug Wolfgram started reworking Mouse Draw into what became PCPaint. The program was completely re-written using Bridge's graphics library and the top-level elements were written in C rather than assembly language. Bridges developed the core graphics code for the first version of PCPaint while Wolfgram worked on the user interface and top-level code. Mouse Systems signed an exclusive agreement with Wolfgram's company, Microtex Industries, Inc., to bundle PCPaint with every mouse they sold. They began publishing PCPaint with their mice in 1984. Microsoft responded in 1985 by bundling a competing product, PC Paintbrush, with version 4 of its DOS drivers for the Microsoft Mouse, replacing its in-house Microsoft Doodle program which it published with version 1 of the DOS drivers in mid-1983. Microsoft’s mouse began to outsell Mouse Systems mouse. In November 1985 Microsoft bundled a cut-down version of PC Paintbrush with Windows 1.0 (called Microsoft Paint), later bundling an updated version of PC Paintbrush with Windows 3.0 (as Paintbrush), impacting PCPaint’s marketshare. In early 1987, Mouse Systems decided that PCPaint wasn't helping to sell mice any longer so they discontinued the bundle deal and returned rights to the code to MicroTex Industries, but retained rights to the name, PCPaint. Wolfgram then combined the paint program with a new animation system he was developing (called GRASP) and Paul Mace Software bought publishing rights to the animation system and PCPaint, which was to be renamed Pictor. Bridges again got involved and took over programming responsibilities for GRASP as well as PCPaint while Wolfgram focused on more of the business details. In creating the first version of PCPaint, Doug had a dual-floppy machine with a Computer Innovations compiler on one disk and source code on the other. John had the "luxury" of a 10MB hard disk in his XT. Data was exchanged daily via 1200, then 2400 baud modems. === Authorship and Ownership === John Bridges and Wolfgram continued to work on PCPaint and GRASP on behalf of Paul Mace Software until 1990. Also in that year, Doug Wolfgram sold his remaining rights to PCPaint (and its animation system, GRASP) to John Bridges. In 1994, GRASP development stopped and so did development of Pictor Paint. John Bridges terminated his GRASP publishing contract with Paul Mace Software, and went off to create GLPro (the next generation of GRASP) with GMEDIA. Along with GLPro, came GLPaint, the successor to PCPaint and Pictor Paint. == Versions == In June 1984, Mouse Systems shipped PCPaint 1.0, the first GUI based Paint program for the IBM PC family of computers. John Bridges and Doug Wolfgram, were the co-authors of PCPaint 1.0. PCPaint 1.0 saved its graphics in a modified BSaved image format with the extension of ".PIC". The release of PCPaint Version 1.5 followed in late 1984, with the additions of graphics image compression for the .PIC format and support for "larger-than-screen" images. PCjr support was also added in this version after overcoming severe memory shortage problems getting PCPaint to run on the 128k PCjr. October 1985 saw the release of PCPaint 2.0. EGA support and publishing features were added to this version. The .PIC format was further refined, offering support for the rapidly expanding graphics capabilities of the PC and efficient image compression. PCPaint 3.1 was released in 1989. Unlike previous versions, it was not bundled with mice but was sold as a stand-alone software product. PCPaint 3.1 offered improved text and image handling, provided 36 types of flood and fill, worked with VGA adapters in hi-res 16-color and 256-color modes, allowed the user to save and retrieve files in a variety of intercompatible formats (.PIC, .GIF, .PCX, .IMG), and printed selected portions of images on color or black-and-white dot matrix, ink jet, and laser printers such as PostScript and HP Laser Jet. PCPaint 3.1 is still in use today by some users of DOS emulation programs like DOSBox and available for free download. Pictor Paint was an improved version, written by John Bridges, and bundled with GRASP GRaphical System for Presentation also written by John Bridges. It was also called "The Painter's Easel". GLPaint, released in 1995, was the last in this series of paint programs written by John Bridges. By 1998 version 7.0 provided support for TrueColor images and the Pictor PIC format was expanded to handle these. == Pictor PIC Image Format == PCPaint 1.0 saved its graphics in a modified BSAVE image format (which was popular at the time) with the file type (extension) of ".PIC". By PCPaint 1.5 this format was extended further to accommodate image compression. With the release of version 2.0 the PICtor PIC image format was developed almost to its present state, with no similarity to the BSAVE format used by earlier versions. Pictor Paint saved its files in a compressed format with the file extension PIC, which was the same format used by PCPaint.
BrownBoost
BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is the case for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001. == Motivation == AdaBoost performs well on a variety of datasets; however, it can be shown that AdaBoost does not perform well on noisy data sets. This is a result of AdaBoost's focus on examples that are repeatedly misclassified. In contrast, BrownBoost effectively "gives up" on examples that are repeatedly misclassified. The core assumption of BrownBoost is that noisy examples will be repeatedly mislabeled by the weak hypotheses and non-noisy examples will be correctly labeled frequently enough to not be "given up on." Thus only noisy examples will be "given up on," whereas non-noisy examples will contribute to the final classifier. In turn, if the final classifier is learned from the non-noisy examples, the generalization error of the final classifier may be much better than if learned from noisy and non-noisy examples. The user of the algorithm can set the amount of error to be tolerated in the training set. Thus, if the training set is noisy (say 10% of all examples are assumed to be mislabeled), the booster can be told to accept a 10% error rate. Since the noisy examples may be ignored, only the true examples will contribute to the learning process. == Algorithm description == BrownBoost uses a non-convex potential loss function, thus it does not fit into the AdaBoost framework. The non-convex optimization provides a method to avoid overfitting noisy data sets. However, in contrast to boosting algorithms that analytically minimize a convex loss function (e.g. AdaBoost and LogitBoost), BrownBoost solves a system of two equations and two unknowns using standard numerical methods. The only parameter of BrownBoost ( c {\displaystyle c} in the algorithm) is the "time" the algorithm runs. The theory of BrownBoost states that each hypothesis takes a variable amount of time ( t {\displaystyle t} in the algorithm) which is directly related to the weight given to the hypothesis α {\displaystyle \alpha } . The time parameter in BrownBoost is analogous to the number of iterations T {\displaystyle T} in AdaBoost. A larger value of c {\displaystyle c} means that BrownBoost will treat the data as if it were less noisy and therefore will give up on fewer examples. Conversely, a smaller value of c {\displaystyle c} means that BrownBoost will treat the data as more noisy and give up on more examples. During each iteration of the algorithm, a hypothesis is selected with some advantage over random guessing. The weight of this hypothesis α {\displaystyle \alpha } and the "amount of time passed" t {\displaystyle t} during the iteration are simultaneously solved in a system of two non-linear equations ( 1. uncorrelated hypothesis w.r.t example weights and 2. hold the potential constant) with two unknowns (weight of hypothesis α {\displaystyle \alpha } and time passed t {\displaystyle t} ). This can be solved by bisection (as implemented in the JBoost software package) or Newton's method (as described in the original paper by Freund). Once these equations are solved, the margins of each example ( r i ( x j ) {\displaystyle r_{i}(x_{j})} in the algorithm) and the amount of time remaining s {\displaystyle s} are updated appropriately. This process is repeated until there is no time remaining. The initial potential is defined to be 1 m ∑ j = 1 m 1 − erf ( c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}({\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Since a constraint of each iteration is that the potential be held constant, the final potential is 1 m ∑ j = 1 m 1 − erf ( r i ( x j ) / c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}(r_{i}(x_{j})/{\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Thus the final error is likely to be near 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} . However, the final potential function is not the 0–1 loss error function. For the final error to be exactly 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} , the variance of the loss function must decrease linearly w.r.t. time to form the 0–1 loss function at the end of boosting iterations. This is not yet discussed in the literature and is not in the definition of the algorithm below. The final classifier is a linear combination of weak hypotheses and is evaluated in the same manner as most other boosting algorithms. == BrownBoost learning algorithm definition == Input: m {\displaystyle m} training examples ( x 1 , y 1 ) , … , ( x m , y m ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{m},y_{m})} where x j ∈ X , y j ∈ Y = { − 1 , + 1 } {\displaystyle x_{j}\in X,\,y_{j}\in Y=\{-1,+1\}} The parameter c {\displaystyle c} Initialise: s = c {\displaystyle s=c} . (The value of s {\displaystyle s} is the amount of time remaining in the game) r i ( x j ) = 0 {\displaystyle r_{i}(x_{j})=0} ∀ j {\displaystyle \forall j} . The value of r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin at iteration i {\displaystyle i} for example x j {\displaystyle x_{j}} . While s > 0 {\displaystyle s>0} : Set the weights of each example: W i ( x j ) = e − ( r i ( x j ) + s ) 2 c {\displaystyle W_{i}(x_{j})=e^{-{\frac {(r_{i}(x_{j})+s)^{2}}{c}}}} , where r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin of example x j {\displaystyle x_{j}} Find a classifier h i : X → { − 1 , + 1 } {\displaystyle h_{i}:X\to \{-1,+1\}} such that ∑ j W i ( x j ) h i ( x j ) y j > 0 {\displaystyle \sum _{j}W_{i}(x_{j})h_{i}(x_{j})y_{j}>0} Find values α , t {\displaystyle \alpha ,t} that satisfy the equation: ∑ j h i ( x j ) y j e − ( r i ( x j ) + α h i ( x j ) y j + s − t ) 2 c = 0 {\displaystyle \sum _{j}h_{i}(x_{j})y_{j}e^{-{\frac {(r_{i}(x_{j})+\alpha h_{i}(x_{j})y_{j}+s-t)^{2}}{c}}}=0} . (Note this is similar to the condition E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} set forth by Schapire and Singer. In this setting, we are numerically finding the W i + 1 = exp ( ⋯ ⋯ ) {\displaystyle W_{i+1}=\exp \left({\frac {\cdots }{\cdots }}\right)} such that E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} .) This update is subject to the constraint ∑ ( Φ ( r i ( x j ) + α h ( x j ) y j + s − t ) − Φ ( r i ( x j ) + s ) ) = 0 {\displaystyle \sum \left(\Phi \left(r_{i}(x_{j})+\alpha h(x_{j})y_{j}+s-t\right)-\Phi \left(r_{i}(x_{j})+s\right)\right)=0} , where Φ ( z ) = 1 − erf ( z / c ) {\displaystyle \Phi (z)=1-{\mbox{erf}}(z/{\sqrt {c}})} is the potential loss for a point with margin r i ( x j ) {\displaystyle r_{i}(x_{j})} Update the margins for each example: r i + 1 ( x j ) = r i ( x j ) + α h ( x j ) y j {\displaystyle r_{i+1}(x_{j})=r_{i}(x_{j})+\alpha h(x_{j})y_{j}} Update the time remaining: s = s − t {\displaystyle s=s-t} Output: H ( x ) = sign ( ∑ i α i h i ( x ) ) {\displaystyle H(x)={\textrm {sign}}\left(\sum _{i}\alpha _{i}h_{i}(x)\right)} == Empirical results == In preliminary experimental results with noisy datasets, BrownBoost outperformed AdaBoost's generalization error; however, LogitBoost performed as well as BrownBoost. An implementation of BrownBoost can be found in the open source software JBoost.