AI Email Helper

AI Email Helper — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Statistical learning theory

    Statistical learning theory

    Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics. == Introduction == The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input–output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict the output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of classification. If the output takes a continuous range of values, it is a regression problem. Using Ohm's law as an example, a regression could be performed with voltage as input and current as an output. The regression would find the functional relationship between voltage and current to be R {\displaystyle R} , such that V = I R {\displaystyle V=IR} Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set. == Formal description == Take X {\displaystyle X} to be the vector space of all possible inputs, and Y {\displaystyle Y} to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown probability distribution over the product space Z = X × Y {\displaystyle Z=X\times Y} , i.e. there exists some unknown p ( z ) = p ( x , y ) {\displaystyle p(z)=p(\mathbf {x} ,y)} . The training set is made up of n {\displaystyle n} samples from this probability distribution, and is notated S = { ( x 1 , y 1 ) , … , ( x n , y n ) } = { z 1 , … , z n } {\displaystyle S=\{(\mathbf {x} _{1},y_{1}),\dots ,(\mathbf {x} _{n},y_{n})\}=\{\mathbf {z} _{1},\dots ,\mathbf {z} _{n}\}} Every x i {\displaystyle \mathbf {x} _{i}} is an input vector from the training data, and y i {\displaystyle y_{i}} is the output that corresponds to it. In this formalism, the inference problem consists of finding a function f : X → Y {\displaystyle f:X\to Y} such that f ( x ) ∼ y {\displaystyle f(\mathbf {x} )\sim y} . Let H {\displaystyle {\mathcal {H}}} be a space of functions f : X → Y {\displaystyle f:X\to Y} called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let V ( f ( x ) , y ) {\displaystyle V(f(\mathbf {x} ),y)} be the loss function, a metric for the difference between the predicted value f ( x ) {\displaystyle f(\mathbf {x} )} and the actual value y {\displaystyle y} . The expected risk is defined to be I [ f ] = ∫ X × Y V ( f ( x ) , y ) p ( x , y ) d x d y {\displaystyle I[f]=\int _{X\times Y}V(f(\mathbf {x} ),y)\,p(\mathbf {x} ,y)\,d\mathbf {x} \,dy} The target function, the best possible function f {\displaystyle f} that can be chosen, is given by the f {\displaystyle f} that satisfies f = argmin h ∈ H ⁡ I [ h ] {\displaystyle f=\mathop {\operatorname {argmin} } _{h\in {\mathcal {H}}}I[h]} Because the probability distribution p ( x , y ) {\displaystyle p(\mathbf {x} ,y)} is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called the empirical risk I S [ f ] = 1 n ∑ i = 1 n V ( f ( x i ) , y i ) {\displaystyle I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})} A learning algorithm that chooses the function f S {\displaystyle f_{S}} that minimizes the empirical risk is called empirical risk minimization. == Loss functions == The choice of loss function is a determining factor on the function f S {\displaystyle f_{S}} that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to be convex. Different loss functions are used depending on whether the problem is one of regression or one of classification. === Regression === The most common loss function for regression is the square loss function (also known as the L2-norm). This familiar loss function is used in Ordinary Least Squares regression. The form is: V ( f ( x ) , y ) = ( y − f ( x ) ) 2 {\displaystyle V(f(\mathbf {x} ),y)=(y-f(\mathbf {x} ))^{2}} The absolute value loss (also known as the L1-norm) is also sometimes used: V ( f ( x ) , y ) = | y − f ( x ) | {\displaystyle V(f(\mathbf {x} ),y)=|y-f(\mathbf {x} )|} === Classification === In some sense the 0-1 indicator function is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with Y = { − 1 , 1 } {\displaystyle Y=\{-1,1\}} , this is: V ( f ( x ) , y ) = θ ( − y f ( x ) ) {\displaystyle V(f(\mathbf {x} ),y)=\theta (-yf(\mathbf {x} ))} where θ {\displaystyle \theta } is the Heaviside step function. == Regularization == In machine learning problems, a major problem that arises is that of overfitting. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input. Empirical risk minimization runs this risk of overfitting: finding a function that matches the data exactly but does not predict future output well. Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function. It can be shown that if the stability for the solution can be guaranteed, generalization and consistency are guaranteed as well. Regularization can solve the overfitting problem and give the problem stability. Regularization can be accomplished by restricting the hypothesis space H {\displaystyle {\mathcal {H}}} . A common example would be restricting H {\displaystyle {\mathcal {H}}} to linear functions: this can be seen as a reduction to the standard problem of linear regression. H {\displaystyle {\mathcal {H}}} could also be restricted to polynomial of degree p {\displaystyle p} , exponentials, or bounded functions on L1. Restriction of the hypothesis space avoids overfitting because the form of the potential functions are limited, and so does not allow for the choice of a function that gives empirical risk arbitrarily close to zero. One example of regularization is Tikhonov regularization. This consists of minimizing 1 n ∑ i = 1 n V ( f ( x i ) , y i ) + γ ‖ f ‖ H 2 {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})+\gamma \left\|f\right\|_{\mathcal {H}}^{2}} where γ {\displaystyle \gamma } is a fixed and positive parameter, the regularization parameter. Tikhonov regularization ensures existence, uniqueness, and stability of the solution. == Bounding empirical risk == Consider a binary classifier f : X → { 0 , 1 } {\displaystyle f:{\mathcal {X}}\to \{0,1\}} . We can apply Hoeffding's inequality to bound the probability that the empirical risk deviates from the true risk to be a Sub-Gaussian distribution. P ( | R ^ ( f ) − R ( f ) | ≥ ϵ ) ≤ 2 e − 2 n ϵ 2 {\displaystyle \mathbb {P} (|{\hat {R}}(f)-R(f)|\geq \epsilon )\leq 2e^{-2n\epsilon ^{2}}} But generally, when we do empirical risk minimization, we are not given a classifier; we must choose it. Therefore, a more useful result is to bound the probability of the supremum of the difference over the whole class. P ( sup f ∈ F | R ^ ( f ) − R ( f ) | ≥ ϵ ) ≤ 2 S ( F , n ) e − n ϵ 2 / 8 ≈ n d e − n ϵ 2 / 8 {\displaystyle \mathbb {P} {\bigg (}\sup _{f\in {\mathcal {F}}}|{\hat {R}}(f)-R(f)|\geq \epsilon {\bigg )}\leq 2S({\mathcal {F}},n)e^{-n\epsilon ^{2}/8}\approx n^{d}e^{-n\epsilon ^{2}/8}} where S ( F , n ) {\displaystyle S({\mathcal {F}},n)} is the shattering number and n {\displaystyle n} is the number of samples in your dataset. The exponential term comes from Hoeffding but there is an extra cost of taking the supremum over the whole cla

    Read more →
  • Tanagra (machine learning)

    Tanagra (machine learning)

    Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).

    Read more →
  • Jpred

    Jpred

    Jpred v.4 is the latest version of the JPred Protein Secondary Structure Prediction Server which provides predictions by the JNet algorithm, one of the most accurate methods for secondary structure prediction, that has existed since 1998 in different versions. In addition to protein secondary structure, JPred also makes predictions of solvent accessibility and coiled-coil regions. The JPred service runs up to 134 000 jobs per month and has carried out over 2 million predictions in total for users in 179 countries. == JPred 2 == The static HTML pages of JPred 2 are still available for reference. == JPred 3 == The JPred v3 followed on from previous versions of JPred developed and maintained by James Cuff and Jonathan Barber (see JPred References). This release added new functionality and fixed many bugs. The highlights are: New, friendlier user interface Retrained and optimised version of Jnet (v2) - mean secondary structure prediction accuracy of >81% Batch submission of jobs Better error checking of input sequences/alignments Predictions now (optionally) returned via e-mail Users may provide their own query names for each submission JPred now makes a prediction even when there are no PSI-BLAST hits to the query PS/PDF output now incorporates all the predictions == JPred 4 == The current version of JPred (v4) has the following improvements and updates incorporated: Retrained on the latest UniRef90 and SCOPe/ASTRAL version of Jnet (v2.3.1) - mean secondary structure prediction accuracy of >82%. Upgraded the Web Server to the latest technologies (Bootstrap framework, JavaScript) and updating the web pages – improving the design and usability through implementing responsive technologies. Added RESTful API and mass-submission and results retrieval scripts - resulting in peak throughput above 20,000 predictions per day. Added prediction jobs monitoring tools. Upgraded the results reporting – both, on the web-site, and through the optional email summary reports: improved batch submission, added results summary preview through Jalview results visualization summary in SVG and adding full multiple sequence alignments into the reports. Improved help-pages, incorporating tool-tips, and adding one-page step-by-step tutorials. Sequence residues are categorised or assigned to one of the secondary structure elements, such as alpha-helix, beta-sheet and coiled-coil. Jnet uses two neural networks for its prediction. The first network is fed with a window of 17 residues over each amino acid in the alignment plus a conservation number. It uses a hidden layer of nine nodes and has three output nodes, one for each secondary structure element. The second network is fed with a window of 19 residues (the result of first network) plus the conservation number. It has a hidden layer with nine nodes and has three output nodes.

    Read more →
  • NETtalk (artificial neural network)

    NETtalk (artificial neural network)

    NETtalk is an artificial neural network that learns to pronounce written English text by supervised learning. It takes English text as input, and produces a matching phonetic transcriptions as output. It is the result of research carried out in the mid-1980s by Terrence Sejnowski and Charles Rosenberg. The intent behind NETtalk was to construct simplified models that might shed light on the complexity of learning human level cognitive tasks, and their implementation as a connectionist model that could also learn to perform a comparable task. The authors trained it by backpropagation. The network was trained on a large amount of English words and their corresponding pronunciations, and is able to generate pronunciations for unseen words with a high level of accuracy. The output of the network was a stream of phonemes, which fed into DECtalk to produce audible speech. It achieved popular success, appearing on the Today show. From the point of view of modeling human cognition, NETtalk does not specifically model the image processing stages and letter recognition of the visual cortex. Rather, it assumes that the letters have been pre-classified and recognized. It is NETtalk's task to learn proper associations between the correct pronunciation with a given sequence of letters based on the context in which the letters appear. A similar architecture was subsequently used for the opposite task, that of converting continuous speech signal to a phoneme sequence. == Training == The training dataset was a 20,008-word subset of the Brown Corpus, with manually annotated phoneme and stress for each letter. The development process was described in a 1993 interview. It took three months -- 250 person-hours -- to create the training dataset, but only a few days to train the network. After it was run successfully on this, the authors tried it on a phonological transcription of an interview with a young Latino boy from a barrio in Los Angeles. This resulted in a network that reproduced his Spanish accent. The original NETtalk was implemented on a Ridge 32, which took 0.275 seconds per learning step (one forward and one backward pass). Training NETtalk became a benchmark to test for the efficiency of backpropagation programs. For example, an implementation on Connection Machine-1 (with 16384 processors) ran at 52x speedup. An implementation on a 10-cell Warp ran at 340x speedup. The following table compiles the benchmark scores as of 1988. Speed is measured in "millions of connections per second" (MCPS). For example, the original NETtalk on Ridge 32 took 0.275 seconds per forward-backward pass, giving 18629 / 10 6 0.275 = 0.068 {\displaystyle {\frac {18629/10^{6}}{0.275}}=0.068} MCPS. Relative times are normalized to the MicroVax. == Architecture == The network had three layers and 18,629 adjustable weights, large by the standards of 1986. There were worries that it would overfit the dataset, but it was trained successfully. The input of the network has 203 units, divided into 7 groups of 29 units each. Each group is a one-hot encoding of one character. There are 29 possible characters: 26 letters, comma, period, and word boundary (whitespace). To produce the pronunciation of a single character, the network takes the character itself, as well as 3 characters before and 3 characters after it. The hidden layer has 80 units. The output has 26 units. 21 units encode for articulatory features (point of articulation, voicing, vowel height, etc.) of phonemes, and 5 units encode for stress and syllable boundaries. Sejnowski studied the learned representation in the network, and found that phonemes that sound similar are clustered together in representation space. The output of the network degrades, but remains understandable, when some hidden neurons are removed.

    Read more →
  • AI data center

    AI data center

    An AI data center is a specialized data center facility designed for the computationally intensive tasks of training and running inference for artificial intelligence (AI) and machine learning models. Unlike general-purpose data centers, they are optimized for the parallel processing demands of AI workloads, typically using hardware such as AI accelerators (e.g., GPUs, TPUs) and high-speed interconnects. The global push to construct these specialized facilities accelerated dramatically during the AI boom of the 2020s. Memory manufacturers prioritized production of High Bandwidth Memory (HBM) essential for AI servers, which led to a global memory supply shortage amid a broader competition for advanced chips, power, and infrastructure. Major tech companies are estimated to spend $650 billion on AI data centers in 2026. == Architecture == Data centers for building and running large machine learning models contain specialized computer chips, GPUs, that use 2 to 4 times as much energy as their regular CPU counterparts (250-500 watts). AI data centers use 60 or more kilowatts per server rack, whereas more standard data centers typically use 5 to 10 kilowatts per rack. == Operators == As of August 2025, The Information tracked 18 planned or existing AI data centers in the United States, operated by Amazon Web Services, CoreWeave, Crusoe, Meta, Microsoft/OpenAI, Oracle, Tesla, and xAI. Other AI data center operators include Digital Realty and Alibaba. Data centers are also being built in China, India, Europe, Saudi Arabia, and Canada. The New Yorker described CoreWeave as the most prominent AI data center operator in the United States. Two types of data center providers for machine learning have been noted: hyperscalers and neoclouds. The Verge listed large technology companies such as Google, Meta, Microsoft, Oracle and Amazon as hyperscalers. The New York Times described neoclouds as "a new generation of data center providers". CoreWeave, Nebius, Nscale, and Lambda have been described as examples of neoclouds. In January 2025, OpenAI, in partnership with Oracle and Softbank, announced the Stargate project, which as of September 2025 is composed of six built or proposed AI data centers in the United States. In response to the Stargate project, Amazon launched in October 2025 an AI data center on 1,200 acres of farmland in Indiana. This data center, known as Project Rainier, is one of the largest AI data centers in the world, with Amazon spending $11 billion on the project. Rainier is specifically intended for training and running machine learning models from Anthropic. As of that time, this facility contains seven data centers (out of an estimated 30 planned) and will use 2.2 gigawatts of electricity (equivalent to 1 million households) and millions of gallons of water per year. Computer chips from Annapurna Labs and Anthropic, Trainium 2, were designed for use in such facilities. Amazon pumped millions of gallons of water out of the ground to construct the data center, and as of June 2025, Indiana state officials are investigating whether this dewatering process led to dry wells for local residents. In November 2025, Anthropic announced a plan in partnership with Fluidstack to develop artificial intelligence infrastructure in the United States, including data centers in New York and Texas, worth $50 billion. Other AI data center projects include the Colossus supercomputer from xAI, a Louisiana-based project from Meta, Hyperion, expected to use 5 GW of power, and a second Ohio-based Meta project, Prometheus, with a capacity of 1 GW. A 3,200-acre AI data center, capable of 4.4-4.5 GW of power and located on the decommissioned Homer City Generating Station, is under construction as of 2025, and will use seven 30-acre gas generating stations supplied by EQT. As of December 2025, CRH is working on over 100 data centers in the United States. In 2025, ExxonMobil and NextEra announced plans to build a data center powered by natural gas and using carbon capture technology, with 1.2 GW of power capacity. They previously purchased 2,500 acres of land in the Southeastern United States and plan to market the data center to an artificial intelligence company. The increased interest in AI data centers has led to several executives from companies in that space becoming billionaires, including CoreWeave, QTS, Nebius, Astera Labs, Groq, Fermi (which is connected to former United States Secretary of Energy Rick Perry), Snowflake and Cipher Mining. Several companies involved in cryptocurrency mining, such as Bitdeer, CoreWeave, Cipher Mining, TeraWulf, IREN, Core Scientific, and CleanSpark have also been involved with AI data centers. == Finances == Between January and August 2024, Microsoft, Meta, Google and Amazon collectively spent $125 billion on AI data centers. Citigroup forecasted that $2.8 trillion would be spent on AI data centers by 2030, while McKinsey and Company estimated that almost $7 trillion would be spent globally by that time. According to S&P Global, $61 billion has been spent on the data center market as a whole in 2025, while debt issuance for data centers was $182 billion during the same year. Large technology companies have offloaded the financial risks of building AI data centers by setting up special purpose vehicles or by contracting with neoclouds. For example, Meta's Hyperion was mostly funded by Blue Owl Capital, which did so using a bond offering from PIMCO. Those bonds were sold to a number of clients, including BlackRock. Meta did not borrow money itself and instead established a special purpose vehicle from which it would rent the data center. This deal was structured by Morgan Stanley for $30 billion, the largest known private capital transaction as of 2025. Neoclouds such as CoreWeave have gone into debt to buy computer chips from Nvidia for their data centers, and the chips themselves have been used for loan collateral. As of December 2025, CoreWeave took out three GPU-backed loans, collectively worth $12.4 billion, from private credit firms (Blackstone, Coatue, BlackRock, PIMCO) and from banks (Goldman Sachs, JPMorgan Chase, Wells Fargo). Thus, these companies provide an indirect connection between private credit and established banks. Data centers have also established asset-backed securities, and debt for data centers has its own derivative financial products. The real estate industry, including asset managers, public companies and private investors, has also invested in data centers. == Energy sourcing == == Environmental footprint == Average AI data centers have an electricity footprint equivalent to 100,000 households, and use billions of gallons of water for cooling their hardware. In 2025, the International Energy Agency estimated that the larger AI data centers currently under construction could consume as much electricity as 2 million households. A 2024 report from the United States Department of Energy stated that data centers overall used 17 billion gallons of water per year in the United States, primarily due to "rapid proliferation of AI servers", and that this usage was forecasted to grow to nearly 80 billion gallons by 2028. Researchers estimated that AI data centers in the United States would emit 24-44 million metric tons of carbon dioxide and use 731–1,125 million cubic meters of water per year between 2024 and 2030. Peaking power plants, which have been proposed as a power source for AI data centers, emit sulfur dioxide and have historically been located disproportionately near communities of color in the United States. Reciprocating internal combustion engines, proposed as another power source for a data center, emit PM 2.5, nitrogen oxides, and volatile organic compounds. == AI data centers in the United States == In the United States, both the Biden administration and second Trump administration supported the construction of AI data centers. In January 2025, then-president Joe Biden signed an executive order for federal government agencies to support AI data centers on federal sites built by private companies, study their effect on energy prices, and encourage their use of renewable energy. In April 2025, the United States Department of Energy suggested 16 possible sites, including Los Alamos National Laboratory, Sandia National Laboratories and Oak Ridge National Laboratory. In its July 2025 AI Action Plan, the second Trump administration supported increased production of AI data centers. Several US states have incentivized local data center construction. For example, in 2024, lawmakers in Michigan approved tax breaks for data center equipment and construction material. Some data center companies have also invested or promised to invest in the infrastructure of local communities. In December 2025, Democratic senators Elizabeth Warren, Chris Van Hollen, and Richard Blumenthal wrote to seven technology companies (Google, Microsoft, Amazon, Meta, CoreWeave, Digital Realty, and Equinix) that they w

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →
  • Targeted maximum likelihood estimation

    Targeted maximum likelihood estimation

    Targeted Maximum Likelihood Estimation (TMLE) (also more accurately referred to as Targeted Minimum Loss-Based Estimation) is a general statistical estimation framework for causal inference and semiparametric models. TMLE combines ideas from maximum likelihood estimation, semiparametric efficiency theory, and machine learning. It was introduced by Mark J. van der Laan and colleagues in the mid-2000s as a method that yields asymptotically efficient plug-in estimators while allowing the use of flexible, data-adaptive algorithms such as ensemble machine learning for nuisance parameter estimation. TMLE is used in epidemiology, biostatistics, and the social sciences to estimate causal effects in observational and experimental studies. Applications of TMLE include Longitudinal TMLE (LTMLE) for time-varying treatments and confounders. Variations in how the targeting step in TMLE is carried out have resulted in various versions of TMLE such as Collaborative TMLE (CTMLE) and Adaptive TMLE for improved finite-sample performance and automated variable selection. == History == The TMLE framework was first described by van der Laan and Rubin (2006) as a general approach for the construction of efficient plug-in estimators of smooth features of the data density. It was demonstrated in the context of causal inference and missing data problems. It was developed to address limitations of traditional doubly robust methods, such as Augmented Inverse Probability Weighting (AIPW), by respecting the plug-in principle in the sense that it respects that the target parameter is a function of the data density that is an element of the statistical model. TMLE estimates the data density or relevant parts of it with machine learning and targets these machine learning fits before it is plugged in the target parameter mapping. In this manner, a TMLE always respects global knowledge and satisfies known bounds such as that the target parameter is a probability . Since its introduction, TMLE has been developed in a series of theoretical and applied papers, culminating in book-length treatments of the method and its applications to survival analysis, adaptive designs, and longitudinal data. == Methodology == At its core, TMLE is a two-step estimation procedure: Initial estimation: Machine learning methods (such as the Super Learner ensemble) are used to obtain flexible estimates of nuisance parameters, such as outcome regressions and propensity scores. Targeting step: The initial estimate is updated by solving a score equation (the efficient influence function) so that the final estimator is consistent, asymptotically normal, and efficient under mild regularity conditions. The targeted machine learning fit is then mapped into the corresponding estimator of the target parameter by simply plugging it in the target parameter mapping. This approach balances the bias–variance trade-off by combining data-adaptive estimation with semiparametric efficiency theory. TMLE is doubly robust, meaning it remains consistent if either the outcome model or the treatment model is consistently estimated. === Formula === Here we explain the TMLE of the average treatment effect of a binary treatment on an outcome adjusting for baseline covariates. Consider i.i.d. observations O i = ( W i , A i , Y i ) {\displaystyle O_{i}=(W_{i},A_{i},Y_{i})} from a distribution P 0 {\displaystyle P_{0}} , where W {\displaystyle W} are baseline covariates, A {\displaystyle A} is a binary treatment, and Y {\displaystyle Y} is an outcome. Let Q ¯ ( a , w ) = E [ Y ∣ A = a , W = w ] {\displaystyle {\bar {Q}}(a,w)=\mathbb {E} [Y\mid A=a,W=w]} represent the outcome model and g ( a ∣ w ) = P ( A = a ∣ W = w ) {\displaystyle g(a\mid w)=P(A=a\mid W=w)} represent the propensity score. The average treatment effect (ATE) is given by ψ 0 = E { Q ¯ ( 1 , W ) − Q ¯ ( 0 , W ) } . {\displaystyle \psi _{0}=\mathbb {E} \{{\bar {Q}}(1,W)-{\bar {Q}}(0,W)\}.} A basic TMLE for the ATE proceeds as follows: Step 1: Estimate initial models. Obtain estimates Q ¯ ^ ( a , w ) {\displaystyle {\hat {\bar {Q}}}(a,w)} and g ^ ( a ∣ w ) {\displaystyle {\hat {g}}(a\mid w)} , often using flexible methods such as Super Learner. Step 2: Compute the clever covariate. Define: H ( A , W ) = A g ^ ( 1 ∣ W ) − 1 − A g ^ ( 0 ∣ W ) . {\displaystyle H(A,W)={\frac {A}{{\hat {g}}(1\mid W)}}-{\frac {1-A}{{\hat {g}}(0\mid W)}}.} Step 3: Estimate the fluctuation parameter. Fit a logistic regression of Y {\displaystyle Y} on H ( A , W ) {\displaystyle H(A,W)} with logit ⁡ ( Q ¯ ^ ( A , W ) ) {\displaystyle \operatorname {logit} ({\hat {\bar {Q}}}(A,W))} as offset. This yields ε ^ {\displaystyle {\hat {\varepsilon }}} , the MLE that solves the score equation: 1 n ∑ i = 1 n H ( A i , W i ) { Y i − Q ¯ ^ ε ( A i , W i ) } = 0. {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}H(A_{i},W_{i}){\big \{}Y_{i}-{\hat {\bar {Q}}}^{\varepsilon }(A_{i},W_{i}){\big \}}=0.} Step 4: Update the initial estimate. Apply the "blip" to obtain the targeted estimate: Q ¯ ^ ∗ ( A , W ) = expit ⁡ ( logit ⁡ ( Q ¯ ^ ( A , W ) ) + ε ^ H ( A , W ) ) . {\displaystyle {\hat {\bar {Q}}}^{}(A,W)=\operatorname {expit} {\Big (}\operatorname {logit} {\big (}{\hat {\bar {Q}}}(A,W){\big )}+{\hat {\varepsilon }}\,H(A,W){\Big )}.} Step 5: Compute the TMLE. The ATE estimate is: ψ ^ TMLE = 1 n ∑ i = 1 n [ Q ¯ ^ ∗ ( 1 , W i ) − Q ¯ ^ ∗ ( 0 , W i ) ] . {\displaystyle {\hat {\psi }}_{\text{TMLE}}={\frac {1}{n}}\sum _{i=1}^{n}{\big [}{\hat {\bar {Q}}}^{}(1,W_{i})-{\hat {\bar {Q}}}^{}(0,W_{i}){\big ]}.} Inference. The efficient influence function (EIF) for the ATE is: D ∗ ( O ) = H ( A , W ) { Y − Q ¯ ∗ ( A , W ) } + Q ¯ ∗ ( 1 , W ) − Q ¯ ∗ ( 0 , W ) − ψ . {\displaystyle D^{}(O)=H(A,W)\{Y-{\bar {Q}}^{}(A,W)\}+{\bar {Q}}^{}(1,W)-{\bar {Q}}^{}(0,W)-\psi .} The variance is estimated by σ ^ 2 = n − 1 ∑ i = 1 n ( D ∗ ( O i ) ) 2 {\displaystyle {\hat {\sigma }}^{2}=n^{-1}\sum _{i=1}^{n}{\big (}D^{}(O_{i}){\big )}^{2}} , yielding Wald-type confidence intervals ψ ^ TMLE ± z 1 − α / 2 σ ^ / n {\displaystyle {\hat {\psi }}_{\text{TMLE}}\pm z_{1-\alpha /2}\,{\hat {\sigma }}/{\sqrt {n}}} . Remark. For continuous outcomes, a linear fluctuation Q ¯ ^ ∗ = Q ¯ ^ + ε ^ H {\displaystyle {\hat {\bar {Q}}}^{}={\hat {\bar {Q}}}+{\hat {\varepsilon }}\,H} may be used instead. For bounded continuous outcomes, the logistic fluctuation (after rescaling Y {\displaystyle Y} to [ 0 , 1 ] {\displaystyle [0,1]} ) is often preferred for improved finite-sample performance. == Applications == TMLE has been applied in: Epidemiology: Estimating causal effects of exposures and interventions in observational cohort studies. Clinical trials and real-world evidence: The Targeted Learning roadmap provides a structured framework for generating and validating real-world evidence (RWE), bridging randomized trials and observational data using TMLE and related estimation techniques. This approach enables transparency, sensitivity analysis, and stronger causal inference for regulatory and clinical trial contexts. High-dimensional settings: Integration with ensemble methods for causal effect estimation. TMLE has been successfully applied in pharmacoepidemiology where a large number of covariates are automatically selected to adjust for confounding. In a study of post–myocardial infarction statin use and 1-year mortality, TMLE demonstrated robust performance relative to inverse probability weighting in scenarios with hundreds of potential confounders. == Derivatives and extensions == Longitudinal TMLE (LTMLE): A methodological extension of TMLE for longitudinal data with time-varying treatments, confounders, and censoring. It allows the estimation of dynamic treatment regimes and intervention-specific causal effects over time. This framework was originally introduced by van der Laan & Gruber (2012). Collaborative TMLE (CTMLE): Enhances finite-sample performance and variable selection by collaboratively fitting the treatment mechanism in conjunction with the target parameter. == Software == Several R packages implement TMLE and related methods: tmle: Functions for binary, categorical, and continuous outcomes. ltmle: Implementation for longitudinal data with time-varying treatments and outcomes. ctmle: Algorithms for collaborative TMLE and adaptive variable selection. SuperLearner: A theoretically grounded, cross-validated ensemble learning method that combines predictions from multiple algorithms to minimize predictive risk. Widely used in TMLE for estimating nuisance parameters. The original implementation is available as the R package SuperLearner. Recent machine learning platforms like H2O AutoML implement similar ensemble strategies, combining diverse learners in parallel and leveraging stacking and blending techniques, effectively functioning as a large-scale Super Learner.

    Read more →
  • Q-learning

    Q-learning

    Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations. For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state. == Reinforcement learning == Reinforcement learning involves an agent, a set of states S {\displaystyle {\mathcal {S}}} , and a set A {\displaystyle {\mathcal {A}}} of actions per state. By performing an action a ∈ A {\displaystyle a\in {\mathcal {A}}} , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy. == Algorithm == After Δ t {\displaystyle \Delta t} steps into the future the agent will decide some next step. The weight for this step is calculated as γ Δ t {\displaystyle \gamma ^{\Delta t}} , where γ {\displaystyle \gamma } (the discount factor) is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ). Assuming γ < 1 {\displaystyle \gamma <1} , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). γ {\displaystyle \gamma } may also be interpreted as the probability to succeed (or survive) at every step Δ t {\displaystyle \Delta t} . The algorithm, therefore, has a function that calculates the quality of a state–action combination: Q : S × A → R {\displaystyle Q:{\mathcal {S}}\times {\mathcal {A}}\to \mathbb {R} } . Before learning begins, ⁠ Q {\displaystyle Q} ⁠ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action A t {\displaystyle A_{t}} , observes a reward R t + 1 {\displaystyle R_{t+1}} , enters a new state S t + 1 {\displaystyle S_{t+1}} (that may depend on both the previous state S t {\displaystyle S_{t}} and the selected action), and Q {\displaystyle Q} is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: Q n e w ( S t , A t ) ← ( 1 − α ⏟ learning rate ) ⋅ Q ( S t , A t ) ⏟ current value + α ⏟ learning rate ⋅ ( R t + 1 ⏟ reward + γ ⏟ discount factor ⋅ max a Q ( S t + 1 , a ) ⏟ estimate of optimal future value ⏟ new value (temporal difference target) ) {\displaystyle Q^{new}(S_{t},A_{t})\leftarrow (1-\underbrace {\alpha } _{\text{learning rate}})\cdot \underbrace {Q(S_{t},A_{t})} _{\text{current value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot {\bigg (}\underbrace {\underbrace {R_{t+1}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(S_{t+1},a)} _{\text{estimate of optimal future value}}} _{\text{new value (temporal difference target)}}{\bigg )}} where R t + 1 {\displaystyle R_{t+1}} is the reward received when moving from the state S t {\displaystyle S_{t}} to the state S t + 1 {\displaystyle S_{t+1}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 ) {\displaystyle (0<\alpha \leq 1)} . Note that Q n e w ( S t , A t ) {\displaystyle Q^{new}(S_{t},A_{t})} is the sum of three terms: ( 1 − α ) Q ( S t , A t ) {\displaystyle (1-\alpha )Q(S_{t},A_{t})} : the current value (weighted by one minus the learning rate) α R t + 1 {\displaystyle \alpha \,R_{t+1}} : the reward R t + 1 {\displaystyle R_{t+1}} to obtain if action A t {\displaystyle A_{t}} is taken when in state S t {\displaystyle S_{t}} (weighted by learning rate) α γ max a Q ( S t + 1 , a ) {\displaystyle \alpha \gamma \max _{a}Q(S_{t+1},a)} : the maximum reward that can be obtained from state S t + 1 {\displaystyle S_{t+1}} (weighted by learning rate and discount factor) An episode of the algorithm ends when state S t + 1 {\displaystyle S_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths. For all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated, but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to equal zero. == Influence of variables == === Learning rate === The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} . === Discount factor === The discount factor ⁠ γ {\displaystyle \gamma } ⁠ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. r t {\displaystyle r_{t}} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ⁠ γ = 1 {\displaystyle \gamma =1} ⁠, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. === Initial conditions (Q0) === Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward r {\displaystyle r} can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value

    Read more →
  • Robot Monk Xian'er

    Robot Monk Xian'er

    Robot Monk Xian'er (Chinese: 贤二机器僧) is a humanoid robot based on the cartoon character Xian'er. It was developed by a team of monks, volunteers and AI experts from Beijing Longquan Monastery in Beijing, China. He can follow human instructions to make body movements, read scriptures and play Buddhist music. He can chat and respond to people's emotional and spiritual questions with Buddhist wisdom. As a chatbot, Robot Monk Xian'er is available on certain public platforms including WeChat and Facebook. Over the years, master Xuecheng, the abbot of Beijing Longquan Monastery, replied to thousands of questions on Sina Weibo. These questions and their answers become the data source of the chatbot.

    Read more →
  • Cartesian genetic programming

    Cartesian genetic programming

    Cartesian genetic programming is a form of genetic programming that uses a graph representation to encode computer programs. It grew from a method of evolving digital circuits developed by Julian F. Miller and Peter Thomson in 1997. The term ‘Cartesian genetic programming’ first appeared in 1999 and was proposed as a general form of genetic programming in 2000. It is called ‘Cartesian’ because it represents a program using a two-dimensional grid of nodes. Miller's keynote explains how CGP works. He edited a book entitled Cartesian Genetic Programming, published in 2011 by Springer. The open source project dCGP implements a differentiable version of CGP developed at the European Space Agency by Dario Izzo, Francesco Biscani and Alessio Mereta able to approach symbolic regression tasks, to find solution to differential equations, find prime integrals of dynamical systems, represent variable topology artificial neural networks and more.

    Read more →
  • Genetic representation

    Genetic representation

    In computer programming, genetic representation is a way of presenting solutions/individuals in evolutionary computation methods. The term encompasses both the concrete data structures and data types used to realize the genetic material of the candidate solutions in the form of a genome, and the relationships between search space and problem space. In the simplest case, the search space corresponds to the problem space (direct representation). The choice of problem representation is tied to the choice of genetic operators, both of which have a decisive effect on the efficiency of the optimization. Genetic representation can encode appearance, behavior, physical qualities of individuals. Difference in genetic representations is one of the major criteria drawing a line between known classes of evolutionary computation. Terminology is often analogous with natural genetics. The block of computer memory that represents one candidate solution is called an individual. The data in that block is called a chromosome. Each chromosome consists of genes. The possible values of a particular gene are called alleles. A programmer may represent all the individuals of a population using binary encoding, permutational encoding, encoding by tree, or any one of several other representations. == Representations in some popular evolutionary algorithms == Genetic algorithms (GAs) are typically linear representations; these are often, but not always, binary. Holland's original description of GA used arrays of bits. Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size. This facilitates simple crossover operation. Depending on the application, variable-length representations have also been successfully used and tested in evolutionary algorithms (EA) in general and genetic algorithms in particular, although the implementation of crossover is more complex in this case. Evolution strategy uses linear real-valued representations, e.g., an array of real values. It uses mostly gaussian mutation and blending/averaging crossover. Genetic programming (GP) pioneered tree-like representations and developed genetic operators suitable for such representations. Tree-like representations are used in GP to represent and evolve functional programs with desired properties. Human-based genetic algorithm (HBGA) offers a way to avoid solving hard representation problems by outsourcing all genetic operators to outside agents, in this case, humans. The algorithm has no need for knowledge of a particular fixed genetic representation as long as there are enough external agents capable of handling those representations, allowing for free-form and evolving genetic representations. === Common genetic representations === binary array integer or real-valued array binary tree natural language parse tree directed graph == Distinction between search space and problem space == Analogous to biology, EAs distinguish between problem space (corresponds to phenotype) and search space (corresponds to genotype). The problem space contains concrete solutions to the problem being addressed, while the search space contains the encoded solutions. The mapping from search space to problem space is called genotype-phenotype mapping. The genetic operators are applied to elements of the search space, and for evaluation, elements of the search space are mapped to elements of the problem space via genotype-phenotype mapping. == Relationships between search space and problem space == The importance of an appropriate choice of search space for the success of an EA application was recognized early on. The following requirements can be placed on a suitable search space and thus on a suitable genotype-phenotype mapping: === Completeness === All possible admissible solutions must be contained in the search space. === Redundancy === When more possible genotypes exist than phenotypes, the genetic representation of the EA is called redundant. In nature, this is termed a degenerate genetic code. In the case of a redundant representation, neutral mutations are possible. These are mutations that change the genotype but do not affect the phenotype. Thus, depending on the use of the genetic operators, there may be phenotypically unchanged offspring, which can lead to unnecessary fitness determinations, among other things. Since the evaluation in real-world applications usually accounts for the lion's share of the computation time, it can slow down the optimization process. In addition, this can cause the population to have higher genotypic diversity than phenotypic diversity, which can also hinder evolutionary progress. In biology, the Neutral Theory of Molecular Evolution states that this effect plays a dominant role in natural evolution. This has motivated researchers in the EA community to examine whether neutral mutations can improve EA functioning by giving populations that have converged to a local optimum a way to escape that local optimum through genetic drift. This is discussed controversially and there are no conclusive results on neutrality in EAs. On the other hand, there are other proven measures to handle premature convergence. === Locality === The locality of a genetic representation corresponds to the degree to which distances in the search space are preserved in the problem space after genotype-phenotype mapping. That is, a representation has a high locality exactly when neighbors in the search space are also neighbors in the problem space. In order for successful schemata not to be destroyed by genotype-phenotype mapping after a minor mutation, the locality of a representation must be high. === Scaling === In genotype-phenotype mapping, the elements of the genotype can be scaled (weighted) differently. The simplest case is uniform scaling: all elements of the genotype are equally weighted in the phenotype. A common scaling is exponential. If integers are binary coded, the individual digits of the resulting binary number have exponentially different weights in representing the phenotype. Example: The number 90 is written in binary (i.e., in base two) as 1011010. If now one of the front digits is changed in the binary notation, this has a significantly greater effect on the coded number than any changes at the rear digits (the selection pressure has an exponentially greater effect on the front digits). For this reason, exponential scaling has the effect of randomly fixing the "posterior" locations in the genotype before the population gets close enough to the optimum to adjust for these subtleties. == Hybridization and repair in genotype-phenotype mapping == When mapping the genotype to the phenotype being evaluated, domain-specific knowledge can be used to improve the phenotype and/or ensure that constraints are met. This is a commonly used method to improve EA performance in terms of runtime and solution quality. It is illustrated below by two of the three examples. == Examples == === Example of a direct representation === An obvious and commonly used encoding for the traveling salesman problem and related tasks is to number the cities to be visited consecutively and store them as integers in the chromosome. The genetic operators must be suitably adapted so that they only change the order of the cities (genes) and do not cause deletions or duplications. Thus, the gene order corresponds to the city order and there is a simple one-to-one mapping. === Example of a complex genotype-phenotype mapping === In a scheduling task with heterogeneous and partially alternative resources to be assigned to a set of subtasks, the genome must contain all necessary information for the individual scheduling operations or it must be possible to derive them from it. In addition to the order of the subtasks to be executed, this includes information about the resource selection. A phenotype then consists of a list of subtasks with their start times and assigned resources. In order to be able to create this, as many allocation matrices must be created as resources can be allocated to one subtask at most. In the simplest case this is one resource, e.g., one machine, which can perform the subtask. An allocation matrix is a two-dimensional matrix, with one dimension being the available time units and the other being the resources to be allocated. Empty matrix cells indicate availability, while an entry indicates the number of the assigned subtask. The creation of allocation matrices ensures firstly that there are no inadmissible multiple allocations. Secondly, the start times of the subtasks can be read from it as well as the assigned resources. A common constraint when scheduling resources to subtasks is that a resource can only be allocated once per time unit and that the reservation must be for a contiguous period of time. To achieve this in a timely manner, which is a c

    Read more →
  • Handwriting recognition

    Handwriting recognition

    Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words. == Offline recognition == Offline handwriting recognition involves the automatic conversion of text in an image into letter codes that are usable within computer and text-processing applications. The data obtained by this form is regarded as a static representation of handwriting. Offline handwriting recognition is comparatively difficult, as different people have different handwriting styles. And, as of today, OCR engines are primarily focused on machine printed text and ICR for hand "printed" (written in capital letters) text. === Traditional techniques === ==== Character extraction ==== Offline character recognition often involves scanning a form or document. This means the individual characters contained in the scanned image will need to be extracted. Tools exist that are capable of performing this step. However, there are several common imperfections in this step. The most common is when characters that are connected are returned as a single sub-image containing both characters. This causes a major problem in the recognition stage. Yet many algorithms are available that reduce the risk of connected characters. ==== Character recognition ==== After individual characters have been extracted, a recognition engine is used to identify the corresponding computer character. Several different recognition techniques are currently available. ===== Feature extraction ===== Feature extraction works in a similar fashion to neural network recognizers. However, programmers must manually determine the properties they feel are important. This approach gives the recognizer more control over the properties used in identification. Yet any system using this approach requires substantially more development time than a neural network because the properties are not learned automatically. === Modern techniques === Where traditional techniques focus on segmenting individual characters for recognition, modern techniques focus on recognizing all the characters in a segmented line of text. Particularly they focus on machine learning techniques that are able to learn visual features, avoiding the limiting feature engineering previously used. State-of-the-art methods use convolutional networks to extract visual features over several overlapping windows of a text line image which a recurrent neural network uses to produce character probabilities. == Online recognition == Online handwriting recognition involves the automatic conversion of text as it is written on a special digitizer or PDA, where a sensor picks up the pen-tip movements as well as pen-up/pen-down switching. This kind of data is known as digital ink and can be regarded as a digital representation of handwriting. The obtained signal is converted into letter codes that are usable within computer and text-processing applications. The elements of an online handwriting recognition interface typically include: a pen or stylus for the user to write with a touch sensitive surface, which may be integrated with, or adjacent to, an output display. a software application which interprets the movements of the stylus across the writing surface, translating the resulting strokes into digital text. The process of online handwriting recognition can be broken down into a few general steps: preprocessing, feature extraction and classification The purpose of preprocessing is to discard irrelevant information in the input data, that can negatively affect the recognition. This concerns speed and accuracy. Preprocessing usually consists of binarization, normalization, sampling, smoothing and denoising. The second step is feature extraction. Out of the two- or higher-dimensional vector field received from the preprocessing algorithms, higher-dimensional data is extracted. The purpose of this step is to highlight important information for the recognition model. This data may include information like pen pressure, velocity or the changes of writing direction. The last big step is classification. In this step, various models are used to map the extracted features to different classes and thus identifying the characters or words the features represent. === Hardware === Commercial products incorporating handwriting recognition as a replacement for keyboard input were introduced in the early 1980s. Examples include handwriting terminals such as the Pencept Penpad and the Inforite point-of-sale terminal. With the advent of the large consumer market for personal computers, several commercial products were introduced to replace the keyboard and mouse on a personal computer with a single pointing/handwriting system, such as those from Pencept, CIC and others. The first commercially available tablet-type portable computer was the Write-Top from Linus Technologies, released in July 1988. Its operating system was based on MS-DOS. In the early 1990s, hardware makers including NCR, IBM and EO released tablet computers running the PenPoint operating system developed by GO Corp. PenPoint used handwriting recognition and gestures throughout and provided the facilities to third-party software. IBM's tablet computer was the first to use the ThinkPad name and used IBM's handwriting recognition. This recognition system was later ported to Microsoft Windows for Pen Computing, and IBM's Pen for OS/2. None of these were commercially successful. Advancements in electronics allowed the computing power necessary for handwriting recognition to fit into a smaller form factor than tablet computers, and handwriting recognition is often used as an input method for hand-held PDAs. The first PDA to provide written input was the Apple Newton, which exposed the public to the advantage of a streamlined user interface. However, the device was not a commercial success, owing to the unreliability of the software, which tried to learn a user's writing patterns. By the time of the release of the Newton OS 2.0, wherein the handwriting recognition was greatly improved, including unique features still not found in current recognition systems such as modeless error correction, the largely negative first impression had been made. After discontinuation of Apple Newton, the feature was incorporated in Mac OS X 10.2 and later as Inkwell. Palm later launched a successful series of PDAs based on the Graffiti recognition system. Graffiti improved usability by defining a set of "unistrokes", or one-stroke forms, for each character. This narrowed the possibility for erroneous input, although memorization of the stroke patterns did increase the learning curve for the user. The Graffiti handwriting recognition was found to infringe on a patent held by Xerox, and Palm replaced Graffiti with a licensed version of the CIC handwriting recognition which, while also supporting unistroke forms, pre-dated the Xerox patent. The court finding of infringement was reversed on appeal, and then reversed again on a later appeal. The parties involved subsequently negotiated a settlement concerning this and other patents. A Tablet PC is a notebook computer with a digitizer tablet and a stylus, which allows a user to handwrite text on the unit's screen. The operating system recognizes the handwriting and converts it into text. Windows Vista and Windows 7 include personalization features that learn a user's writing patterns or vocabulary for English, Japanese, Chinese Traditional, Chinese Simplified and Korean. The features include a "personalization wizard" that prompts for samples of a user's handwriting and uses them to retrain the system for higher accuracy recognition. This system is distinct from the less advanced handwriting recognition system employed in its Windows Mobile OS for PDAs. Although handwriting recognition is an input form that the public has become accustomed to, it has not achieved widespread use in either desktop computers or laptops. It is still generally accepted that keyboard input is both faster and more reliable. As of 2006, many PDAs offer handwriting input, sometimes even accepting natural cursive handwriting, but accuracy is still a problem, and some people still find even a simple on-screen keyboard more efficient. === Software === Early software could understand print handwriting where the characters were separated; however, cursive handwriting

    Read more →
  • Conference app

    Conference app

    A conference app, also known as an event app or meeting app, is a mobile app developed to help attendees and meeting planners manage their conference experience. It typically includes conference proceedings and venue information, allowing users to create personalized schedules and engage with other users. A conference app can be a native app or web-based. In recent years, conference apps have gained in popularity as a sustainable solution for event management by reducing paper produced by printed materials. Advanced features often include real-time notifications for updates or changes, integration with virtual meeting platforms for hybrid or fully online events, and analytics tools for organizers to measure attendance and engagement. Additionally, some apps support sponsorship and exhibitor features, enabling businesses to showcase their products or services directly within the app.

    Read more →
  • CN2 algorithm

    CN2 algorithm

    The CN2 induction algorithm is a learning algorithm for rule induction. It is designed to work even when the training data is imperfect. It is based on ideas from the AQ algorithm and the ID3 algorithm. As a consequence it creates a rule set like that created by AQ but is able to handle noisy data like ID3. == Description of algorithm == The algorithm must be given a set of examples, TrainingSet, which have already been classified in order to generate a list of classification rules. A set of conditions, SimpleConditionSet, which can be applied, alone or in combination, to any set of examples is predefined to be used for the classification. routine CN2(TrainingSet) let the ClassificationRuleList be empty repeat let the BestConditionExpression be Find_BestConditionExpression(TrainingSet) if the BestConditionExpression is not nil then let the TrainingSubset be the examples covered by the BestConditionExpression remove from the TrainingSet the examples in the TrainingSubset let the MostCommonClass be the most common class of examples in the TrainingSubset append to the ClassificationRuleList the rule 'if ' the BestConditionExpression ' then the class is ' the MostCommonClass until the TrainingSet is empty or the BestConditionExpression is nil return the ClassificationRuleList routine Find_BestConditionExpression(TrainingSet) let the ConditionalExpressionSet be empty let the BestConditionExpression be nil repeat let the TrialConditionalExpressionSet be the set of conditional expressions, {x and y where x belongs to the ConditionalExpressionSet and y belongs to the SimpleConditionSet}. remove all formulae in the TrialConditionalExpressionSet that are either in the ConditionalExpressionSet (i.e., the unspecialized ones) or null (e.g., big = y and big = n) for every expression, F, in the TrialConditionalExpressionSet if F is statistically significant and F is better than the BestConditionExpression by user-defined criteria when tested on the TrainingSet then replace the current value of the BestConditionExpression by F while the number of expressions in the TrialConditionalExpressionSet > user-defined maximum remove the worst expression from the TrialConditionalExpressionSet let the ConditionalExpressionSet be the TrialConditionalExpressionSet until the ConditionalExpressionSet is empty return the BestConditionExpression

    Read more →
  • Q-learning

    Q-learning

    Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations. For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state. == Reinforcement learning == Reinforcement learning involves an agent, a set of states S {\displaystyle {\mathcal {S}}} , and a set A {\displaystyle {\mathcal {A}}} of actions per state. By performing an action a ∈ A {\displaystyle a\in {\mathcal {A}}} , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy. == Algorithm == After Δ t {\displaystyle \Delta t} steps into the future the agent will decide some next step. The weight for this step is calculated as γ Δ t {\displaystyle \gamma ^{\Delta t}} , where γ {\displaystyle \gamma } (the discount factor) is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ). Assuming γ < 1 {\displaystyle \gamma <1} , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). γ {\displaystyle \gamma } may also be interpreted as the probability to succeed (or survive) at every step Δ t {\displaystyle \Delta t} . The algorithm, therefore, has a function that calculates the quality of a state–action combination: Q : S × A → R {\displaystyle Q:{\mathcal {S}}\times {\mathcal {A}}\to \mathbb {R} } . Before learning begins, ⁠ Q {\displaystyle Q} ⁠ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action A t {\displaystyle A_{t}} , observes a reward R t + 1 {\displaystyle R_{t+1}} , enters a new state S t + 1 {\displaystyle S_{t+1}} (that may depend on both the previous state S t {\displaystyle S_{t}} and the selected action), and Q {\displaystyle Q} is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: Q n e w ( S t , A t ) ← ( 1 − α ⏟ learning rate ) ⋅ Q ( S t , A t ) ⏟ current value + α ⏟ learning rate ⋅ ( R t + 1 ⏟ reward + γ ⏟ discount factor ⋅ max a Q ( S t + 1 , a ) ⏟ estimate of optimal future value ⏟ new value (temporal difference target) ) {\displaystyle Q^{new}(S_{t},A_{t})\leftarrow (1-\underbrace {\alpha } _{\text{learning rate}})\cdot \underbrace {Q(S_{t},A_{t})} _{\text{current value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot {\bigg (}\underbrace {\underbrace {R_{t+1}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(S_{t+1},a)} _{\text{estimate of optimal future value}}} _{\text{new value (temporal difference target)}}{\bigg )}} where R t + 1 {\displaystyle R_{t+1}} is the reward received when moving from the state S t {\displaystyle S_{t}} to the state S t + 1 {\displaystyle S_{t+1}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 ) {\displaystyle (0<\alpha \leq 1)} . Note that Q n e w ( S t , A t ) {\displaystyle Q^{new}(S_{t},A_{t})} is the sum of three terms: ( 1 − α ) Q ( S t , A t ) {\displaystyle (1-\alpha )Q(S_{t},A_{t})} : the current value (weighted by one minus the learning rate) α R t + 1 {\displaystyle \alpha \,R_{t+1}} : the reward R t + 1 {\displaystyle R_{t+1}} to obtain if action A t {\displaystyle A_{t}} is taken when in state S t {\displaystyle S_{t}} (weighted by learning rate) α γ max a Q ( S t + 1 , a ) {\displaystyle \alpha \gamma \max _{a}Q(S_{t+1},a)} : the maximum reward that can be obtained from state S t + 1 {\displaystyle S_{t+1}} (weighted by learning rate and discount factor) An episode of the algorithm ends when state S t + 1 {\displaystyle S_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths. For all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated, but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to equal zero. == Influence of variables == === Learning rate === The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} . === Discount factor === The discount factor ⁠ γ {\displaystyle \gamma } ⁠ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. r t {\displaystyle r_{t}} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ⁠ γ = 1 {\displaystyle \gamma =1} ⁠, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. === Initial conditions (Q0) === Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward r {\displaystyle r} can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value

    Read more →