Best AI for Resume

Best AI for Resume — hands-on reviews, top picks, pricing, pros and cons and a practical how-to guide on Aizhi.

  • Artificial Inventor Project

    Artificial Inventor Project

    The Artificial Inventor Project (AIP) is a global legal initiative headed by Professor Ryan Abbott dedicated to pursuing intellectual property (IP) rights for inventions and creative works generated autonomously by artificial intelligence (AI) systems without traditional human inventorship or authorship. The project coordinates a series of pro bono test cases worldwide, aiming to prompt law reform and public debate on how IP law should accommodate non-human creators. == History == In 2019, AIP filed patent applications in multiple jurisdictions, including the United States, United Kingdom, European Patent Office, Australia, Switzerland, and South Africa, naming the AI system DABUS (Device for the Autonomous Bootstrapping of Unified Sentience), created by Stephen Thaler, as the inventor. The aim was to challenge legal norms that require inventors to be natural persons and highlight pressing policy questions about AI-generated innovation and IP regimes. == Legal proceedings by jurisdiction == === Australia === In July 2021, a Federal Court of Australia judge (Beach J) ruled that AI can be considered an inventor under the Patents Act 1990, ordering IP Australia to reinstate the relevant patent. However, the full court then overturned this ruling on appeal and denied further review. === European Patent Office === The EPO Board of Appeal determined in 2022 that only a human inventor may be named, rendering DABUS‑based applications unacceptable. === South Africa === In 2021, a patent was granted listing DABUS as the inventor. As South Africa’s procedural system does not involve substantive inventorship review, the grant proceeded on formal grounds alone. === Switzerland === On 26 June 2025, the Swiss Federal Administrative Court ruled that artificial intelligence systems such as DABUS cannot be listed as inventors on patent applications. The court upheld the existing practice of the Swiss Federal Institute of Intellectual Property (IPI), affirming that only natural persons may be recognized as inventors under Swiss patent law. === United Kingdom === In December 2023, the UK Supreme Court unanimously held that AI systems cannot be legally recognized as inventors, affirming that "an inventor must be a person" under current British law. === United States === In Thaler v. Hirshfeld (2021), a U.S. federal court agreed with the USPTO that inventors must be natural persons, rejecting the DABUS application and setting a precedent consistent with existing statute and administrative policy. == Criticism and impact == The project has fueled substantial discourse. Critics caution that allowing AI inventorship may complicate notions of accountability and ownership. Proponents argue that legal recognition must evolve to avoid disincentivizing innovation produced by AI and to maintain honesty about the true source of invention.

    Read more →
  • Cellular evolutionary algorithm

    Cellular evolutionary algorithm

    A cellular evolutionary algorithm (cEA) is a kind of evolutionary algorithm (EA) in which individuals cannot mate arbitrarily, but every one interacts with its closer neighbors on which a basic EA is applied (selection, variation, replacement). The cellular model simulates natural evolution from the point of view of the individual, which encodes a tentative optimization, learning, or search problem solution. The essential idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual who communicates with his nearest neighbors. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as "isolation by distance". The set of potential mates of an individual is called its "neighborhood". It is known that, in this kind of algorithm, similar individuals tend to cluster creating niches, and these groups operate as if they were separate sub-populations (islands). There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive niches and potentially merge solution contents during the process. Simultaneously, farther niches can be affected more slowly. == Introduction == A cellular evolutionary algorithm (cEA) usually evolves a structured bidimensional grid of individuals, although other topologies are also possible. In this grid, clusters of similar individuals are naturally created during evolution, promoting exploration in their boundaries, while exploitation is mainly performed by direct competition and merging inside them. The grid is usually 2D toroidal structure, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring). The neighborhood of a particular point of the grid (where an individual is placed) is defined in terms of the Manhattan distance from it to others in the population. Each point of the grid has a neighborhood that overlaps the neighborhoods of nearby individuals. In the basic algorithm, all the neighborhoods have the same size and identical shapes. The two most commonly used neighborhoods are L5, also called the Von Neumann or NEWS (North, East, West and South) neighborhood, and C9, also known as the Moore neighborhood. Here, L stands for "linear" while C stands for "compact". In cEAs, the individuals can only interact with their neighbors in the reproductive cycle where the variation operators are applied. This reproductive cycle is executed inside the neighborhood of each individual and, generally, consists in selecting two parents among its neighbors according to a certain criterion, applying the variation operators to them (recombination and mutation for example), and replacing the considered individual by the recently created offspring following a given criterion, for instance, replace if the offspring represents a better solution than the considered individual. == Synchronous versus asynchronous == In a regular synchronous cEA, the algorithm proceeds from the very first top left individual to the right and then to the several rows by using the information in the population to create a new temporary population. After finishing with the bottom-right last individual the temporary population is full with the newly computed individuals, and the replacement step starts. In it, the old population is completely and synchronously replaced with the newly computed one according to some criterion. Usually, the replacement keeps the best individual in the same position of both populations, that is, elitism is used. According to the update policy of the population used, an asynchronous cEA may also be defined and is a well-known issue in cellular automata. In asynchronous cEAs the order in which the individuals in the grid are update changes depending on the choice of criterion: line sweep, fixed random sweep, new random sweep, and uniform choice. All four proceed using the newly computed individual (or the original if better) for the computations of its neighbors. The overlap of the neighborhoods provides an implicit mechanism of solution migration to the cEA. Since the best solutions spread smoothly through the whole population, genetic diversity in the population is preserved longer than in non structured EAs. This soft dispersion of the best solutions through the population is one of the main issues of the good tradeoff between exploration and exploitation that cEAs perform during the search. This tradeoff can be tuned (and by extension the genetic diversity level along the evolution) by modifying (for instance) the size of the neighborhood used, as the overlap degree between the neighborhoods grows according to the size of the neighborhood. A cEA can be seen as a cellular automaton (CA) with probabilistic rewritable rules, where the alphabet of the CA is equivalent to the potential number of solutions of the problem. Hence, knowledge from research in CAs can be applied to cEAs. == Parallelism == Cellular EAs are very amenable to parallelism, thus usually found in the literature of parallel metaheuristics. In particular, fine grain parallelism can be used to assign independent threads of execution to every individual, thus allowing the whole cEA to run on a concurrent or actually parallel hardware platform. In this way, large time reductions can be obtained when running cEAs on FPGAs or GPUs. However, it is important to stress that cEAs are a model of search, in many senses different from traditional EAs. Also, they can be run in sequential and parallel platforms, reinforcing the fact that the model and the implementation are two different concepts. See here for a complete description on the fundamentals for the understanding, design, and application of cEAs.

    Read more →
  • Averaged one-dependence estimators

    Averaged one-dependence estimators

    Averaged one-dependence estimators (AODE) is a probabilistic classification learning technique. It was developed to address the attribute-independence problem of the popular naive Bayes classifier. It frequently develops substantially more accurate classifiers than naive Bayes at the cost of a modest increase in the amount of computation. == The AODE classifier == AODE seeks to estimate the probability of each class y given a specified set of features x1, ... xn, P(y | x1, ... xn). To do so it uses the formula P ^ ( y ∣ x 1 , … x n ) = ∑ i : 1 ≤ i ≤ n ∧ F ( x i ) ≥ m P ^ ( y , x i ) ∏ j = 1 n P ^ ( x j ∣ y , x i ) ∑ y ′ ∈ Y ∑ i : 1 ≤ i ≤ n ∧ F ( x i ) ≥ m P ^ ( y ′ , x i ) ∏ j = 1 n P ^ ( x j ∣ y ′ , x i ) {\displaystyle {\hat {P}}(y\mid x_{1},\ldots x_{n})={\frac {\sum _{i:1\leq i\leq n\wedge F(x_{i})\geq m}{\hat {P}}(y,x_{i})\prod _{j=1}^{n}{\hat {P}}(x_{j}\mid y,x_{i})}{\sum _{y^{\prime }\in Y}\sum _{i:1\leq i\leq n\wedge F(x_{i})\geq m}{\hat {P}}(y^{\prime },x_{i})\prod _{j=1}^{n}{\hat {P}}(x_{j}\mid y^{\prime },x_{i})}}} where P ^ ( ⋅ ) {\displaystyle {\hat {P}}(\cdot )} denotes an estimate of P ( ⋅ ) {\displaystyle P(\cdot )} , F ( ⋅ ) {\displaystyle F(\cdot )} is the frequency with which the argument appears in the sample data and m is a user specified minimum frequency with which a term must appear in order to be used in the outer summation. In recent practice m is usually set at 1. == Derivation of the AODE classifier == We seek to estimate P(y | x1, ... xn). By the definition of conditional probability P ( y ∣ x 1 , … x n ) = P ( y , x 1 , … x n ) P ( x 1 , … x n ) . {\displaystyle P(y\mid x_{1},\ldots x_{n})={\frac {P(y,x_{1},\ldots x_{n})}{P(x_{1},\ldots x_{n})}}.} For any 1 ≤ i ≤ n {\displaystyle 1\leq i\leq n} , P ( y , x 1 , … x n ) = P ( y , x i ) P ( x 1 , … x n ∣ y , x i ) . {\displaystyle P(y,x_{1},\ldots x_{n})=P(y,x_{i})P(x_{1},\ldots x_{n}\mid y,x_{i}).} Under an assumption that x1, ... xn are independent given y and xi, it follows that P ( y , x 1 , … x n ) = P ( y , x i ) ∏ j = 1 n P ( x j ∣ y , x i ) . {\displaystyle P(y,x_{1},\ldots x_{n})=P(y,x_{i})\prod _{j=1}^{n}P(x_{j}\mid y,x_{i}).} This formula defines a special form of One Dependence Estimator (ODE), a variant of the naive Bayes classifier that makes the above independence assumption that is weaker (and hence potentially less harmful) than the naive Bayes' independence assumption. In consequence, each ODE should create a less biased estimator than naive Bayes. However, because the base probability estimates are each conditioned by two variables rather than one, they are formed from less data (the training examples that satisfy both variables) and hence are likely to have more variance. AODE reduces this variance by averaging the estimates of all such ODEs. == Features of the AODE classifier == Like naive Bayes, AODE does not perform model selection and does not use tuneable parameters. As a result, it has low variance. It supports incremental learning whereby the classifier can be updated efficiently with information from new examples as they become available. It predicts class probabilities rather than simply predicting a single class, allowing the user to determine the confidence with which each classification can be made. Its probabilistic model can directly handle situations where some data are missing. AODE has computational complexity O ( l n 2 ) {\displaystyle O(ln^{2})} at training time and O ( k n 2 ) {\displaystyle O(kn^{2})} at classification time, where n is the number of features, l is the number of training examples and k is the number of classes. This makes it infeasible for application to high-dimensional data. However, within that limitation, it is linear with respect to the number of training examples and hence can efficiently process large numbers of training examples. == Implementations == The free Weka machine learning suite includes an implementation of AODE.

    Read more →
  • Soft independent modelling of class analogies

    Soft independent modelling of class analogies

    Soft independent modelling by class analogy (SIMCA) is a statistical method for supervised classification of data. The method requires a training data set consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes. == Method == In order to build the classification models, the samples belonging to each class need to be analysed using principal component analysis (PCA); only the significant components are retained. For a given class, the resulting model then describes either a line (for one Principal Component or PC), plane (for two PCs) or hyper-plane (for more than two PCs). For each modelled class, the mean orthogonal distance of training data samples from the line, plane, or hyper-plane (calculated as the residual standard deviation) is used to determine a critical distance for classification. This critical distance is based on the F-distribution and is usually calculated using 95% or 99% confidence intervals. New observations are projected into each PC model and the residual distances calculated. An observation is assigned to the model class when its residual distance from the model is below the statistical limit for the class. The observation may be found to belong to multiple classes and a measure of goodness of the model can be found from the number of cases where the observations are classified into multiple classes. The classification efficiency is usually indicated by Receiver operating characteristics. In the original SIMCA method, the ends of the hyper-plane of each class are closed off by setting statistical control limits along the retained principal components axes (i.e., score value between plus and minus 0.5 times score standard deviation). More recent adaptations of the SIMCA method close off the hyper-plane by construction of ellipsoids (e.g. Hotelling's T2 or Mahalanobis distance). With such modified SIMCA methods, classification of an object requires both that its orthogonal distance from the model and its projection within the model (i.e. score value within the region defined by the ellipsoid) are not significant. == Application == SIMCA as a method of classification has gained widespread use especially in applied statistical fields such as chemometrics and spectroscopic data analysis.

    Read more →
  • Inpainting

    Inpainting

    Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art mediums such as oil or acrylic paintings, chemical photographic prints, sculptures, or digital images and video. With its roots in physical artwork, such as painting and sculpture, traditional inpainting is performed by a trained art conservator who has carefully studied the artwork to determine the mediums and techniques used in the piece, potential risks of treatments, and ethical appropriateness of treatment. == History == The modern use of inpainting can be traced back to Pietro Edwards (1744–1821), Director of the Restoration of the Public Pictures in Venice, Italy. Using a scientific approach, Edwards focused his restoration efforts on the intentions of the artist. It was during the 1930 International Conference for the Study of Scientific Methods for the Examination and Preservation of Works of Art, that the modern approach to inpainting was established. Helmut Ruhemann (1891–1973), a German restorer and conservator, led the discussions on the use of inpainting in conservation. Helmut Ruhemann was a leading figure in modernizing restoration and conservation. His greatest contribution to the field of conservation "was his insistence on following the methods of the original painter exactly, and on understanding the painter's artistic intention". After his career of over 40 years as a conservator, Ruhemann published his treatise The Cleaning of Paintings: Problems & Potentialities in 1968. In describing his method, Ruhemann states that "The surface [of the fill] should be slightly lower than that of the surrounding paint to allow for the thickness of the inpainting...Inpainting medium should look and behave like the original medium, but must not darken with age." Cesare Brandi (1906–1988) developed the teoria del restauro, the inpainting approach combining aesthetics and psychology. However, this approach was used primarily by Italian restorers and conservators, with the terminology becoming widespread in the 1990s. Technological advancements led to new applications of inpainting. Widespread use of digital techniques range from entirely automatic computerized inpainting to tools used to simulate the process manually. Since the mid-1990s, the process of inpainting has evolved to include digital media. More commonly known as image or video interpolation, a form of estimation, digital inpainting includes the use of computer software that relies on sophisticated algorithms to replace lost or corrupted parts of the image data. == Ethics == In order to preserve the integrity of an original artwork, any inpainting technique or treatment applied to physical or digital work should be reversible or distinguishable from the original content of the artwork. Prior to any treatments, conservators proceed according to the American Institute of Conservation of Historical and Artistic Works. There are several ethic considerations before Inpainting can be justified. Various deliberation decisions over the ethical appropriateness of the amount and type of inpainting done, resides on many factors. As most conservation treatments, inpainting's ethical questions rest mainly with authenticity, reversibility and documentation.Any intervention to compensate for loss should be documented in treatment records and reports and should be detectable by common examination methods. Such compensation should be reversible and should not falsely modify the known aesthetic, conceptual, and physical characteristics of the cultural property, especially by removing or obscuring original material.New technologies and the aesthetic demand for perfect images without imperfections challenge conservators' ethical practices to protect the integrity of originals. == Methods == Inpainting methods and techniques depend on the desired goal and type of image being treated. Treatments to fill in the gaps are different between physical and digital art. In inpainting, detailed records of the initial state of the images can help with the treatment and replicate the original closer. === Physical inpainting === Inpainting is rooted in the conservation and restoration of paintings. Inpainting can aim to make a visual improvement to the artwork as a whole by repairing missing or damaged parts using methods and materials equivalent to the original artist's work. ==== Application techniques ==== By studying the painting methods of various artists and the composition of paints used historically, conservators are able to restore works very closely to their original visual appearance. The picture as a whole determines how to fill in the gap. Helmut Ruhemann's inpainting techniques by Jessell have procedures to "preserve" the quality of oil and tempera paintings. === Digital inpainting === Many programs are able to reconstruct missing or damaged areas of digital photographs and videos. Most widely known for use with digital images is Adobe Photoshop. Given the various abilities of the digital camera and the digitization of old photos, inpainting has become an automatic process that can be performed on digital images. The inpainting techniques can be applied to object removal, text removal, and other automatic modifications of images and videos. In video special effects, inpainting is usually performed after video matting. They can also be observed in applications like image compression and super-resolution. In photography and cinema, it is used for film restoration to reverse, repair, or mitigate deterioration (e.g., physical damage such as cracks in photographs, scratches and dust spots in film, or chemical damage resulting in image loss; performed infrared cleaning). It can also be used for removing red-eye, the stamped date from photographs, and objects for creative effect. This technique can be used to replace any lost blocks in the coding and transmission of images, for example, in a streaming video. It can also be used to remove logos or watermarks in videos. Deep learning neural network-based inpainting can be used for decensoring images. Deep image prior-based techniques can be used for digital image inpainting, where a trained deep learning model is either unavailable or infeasible. Deep models for visual content generation, like text-to-image or text-to-video, learn complex priors over the distribution of visual content, and can be used to inpaint missing parts. For example, videos can be separated into layers, using a technique called omnimatte, which either pretrain an omnimatte model or without any training using an omnimatte-zero model. Three main groups of 2D image-inpainting algorithms can be found in the literature. The first one to be noted is structural (or geometric) inpainting, the second one is texture inpainting, the last one is a combination of these two techniques. They use the information of the known or non-destroyed image areas in order to fill the gap, similar to how physical images are restored. ==== Structural ==== Structural or geometric inpainting is used for smooth images that have strong, defined borders. There are many different approaches to geometric inpainting, but they all come from the idea that geometry can be recovered from similar areas or domains. Bertalmio proposed a method of structural inpainting that mimics how conservators address painting restoration. Bertalmio proposed that by progressively transferring similar information from the borders of an inpainting domain inwards, the gap can be filled. ==== Textural ==== While structural/geometric inpainting works to repair smooth images, textural inpainting works best with images that are heavily textured. Texture has a repetitive pattern which means that a missing portion cannot be restored by continuing the level lines into the gap; level lines provide a complete, stable representation of an image. To repair texture in an image, one can combine frequency and spatial domain information to fill in a selected area with a desired texture. This method, while the most simple and very effective, works well when selecting a texture to be in-painted. For a texture that covers a wider area or a larger frame one would have to go through the image segmenting the areas to be in-painted and selecting the corresponding textures from throughout the image; there are programs that can help find the corresponding areas that work in a similar way as 'find and replace' works in a word processor. ==== Combined structural and textural ==== Combined structural and textural inpainting approaches simultaneously try to perform texture- and structure-filling in regions of missing image information. Most parts of an image consist of texture and structure and the boundaries between image regions contain a large amount of structural information. This is the result when blending differ

    Read more →
  • Memtransistor

    Memtransistor

    The memtransistor (a blend word from Memory Transfer Resistor) is an experimental multi-terminal passive electronic component that might be used in the construction of artificial neural networks. It is a combination of the memristor and transistor technology. This technology is different from the 1T-1R approach since the devices are merged into one single entity. Multiple memristors can be embedded with a single transistor, enabling it to more accurately model a neuron with its multiple synaptic connections. A neural network produced from these would provide hardware-based artificial intelligence with a good foundation. == Applications == These types of devices would allow for a synapse model that could realise a learning rule, by which the synaptic efficacy is altered by voltages applied to the terminals of the device. An example of such a learning rule is spike-timing-dependant-plasticty by which the weight of the synapse, in this case the conductivity, could be modulated based on the timing of pre and post synaptic spikes arriving at each terminal. The advantage of this approach over two terminal memristive devices is that read and write protocols have the possibility to occur simultaneously and distinctly.

    Read more →
  • Q-learning

    Q-learning

    Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations. For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state. == Reinforcement learning == Reinforcement learning involves an agent, a set of states S {\displaystyle {\mathcal {S}}} , and a set A {\displaystyle {\mathcal {A}}} of actions per state. By performing an action a ∈ A {\displaystyle a\in {\mathcal {A}}} , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy. == Algorithm == After Δ t {\displaystyle \Delta t} steps into the future the agent will decide some next step. The weight for this step is calculated as γ Δ t {\displaystyle \gamma ^{\Delta t}} , where γ {\displaystyle \gamma } (the discount factor) is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ). Assuming γ < 1 {\displaystyle \gamma <1} , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). γ {\displaystyle \gamma } may also be interpreted as the probability to succeed (or survive) at every step Δ t {\displaystyle \Delta t} . The algorithm, therefore, has a function that calculates the quality of a state–action combination: Q : S × A → R {\displaystyle Q:{\mathcal {S}}\times {\mathcal {A}}\to \mathbb {R} } . Before learning begins, ⁠ Q {\displaystyle Q} ⁠ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action A t {\displaystyle A_{t}} , observes a reward R t + 1 {\displaystyle R_{t+1}} , enters a new state S t + 1 {\displaystyle S_{t+1}} (that may depend on both the previous state S t {\displaystyle S_{t}} and the selected action), and Q {\displaystyle Q} is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: Q n e w ( S t , A t ) ← ( 1 − α ⏟ learning rate ) ⋅ Q ( S t , A t ) ⏟ current value + α ⏟ learning rate ⋅ ( R t + 1 ⏟ reward + γ ⏟ discount factor ⋅ max a Q ( S t + 1 , a ) ⏟ estimate of optimal future value ⏟ new value (temporal difference target) ) {\displaystyle Q^{new}(S_{t},A_{t})\leftarrow (1-\underbrace {\alpha } _{\text{learning rate}})\cdot \underbrace {Q(S_{t},A_{t})} _{\text{current value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot {\bigg (}\underbrace {\underbrace {R_{t+1}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(S_{t+1},a)} _{\text{estimate of optimal future value}}} _{\text{new value (temporal difference target)}}{\bigg )}} where R t + 1 {\displaystyle R_{t+1}} is the reward received when moving from the state S t {\displaystyle S_{t}} to the state S t + 1 {\displaystyle S_{t+1}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 ) {\displaystyle (0<\alpha \leq 1)} . Note that Q n e w ( S t , A t ) {\displaystyle Q^{new}(S_{t},A_{t})} is the sum of three terms: ( 1 − α ) Q ( S t , A t ) {\displaystyle (1-\alpha )Q(S_{t},A_{t})} : the current value (weighted by one minus the learning rate) α R t + 1 {\displaystyle \alpha \,R_{t+1}} : the reward R t + 1 {\displaystyle R_{t+1}} to obtain if action A t {\displaystyle A_{t}} is taken when in state S t {\displaystyle S_{t}} (weighted by learning rate) α γ max a Q ( S t + 1 , a ) {\displaystyle \alpha \gamma \max _{a}Q(S_{t+1},a)} : the maximum reward that can be obtained from state S t + 1 {\displaystyle S_{t+1}} (weighted by learning rate and discount factor) An episode of the algorithm ends when state S t + 1 {\displaystyle S_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths. For all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated, but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to equal zero. == Influence of variables == === Learning rate === The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} . === Discount factor === The discount factor ⁠ γ {\displaystyle \gamma } ⁠ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. r t {\displaystyle r_{t}} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ⁠ γ = 1 {\displaystyle \gamma =1} ⁠, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. === Initial conditions (Q0) === Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward r {\displaystyle r} can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →
  • Content as a service

    Content as a service

    Content as a service (CaaS) or managed content as a service (MCaaS) is a service-oriented model, where the service provider delivers the content on demand to the service consumer via web services that are licensed under subscription. The content is hosted by the service provider centrally in the cloud and offered to a number of consumers that need the content delivered into any applications or system, hence content can be demanded by the consumers as and when required. Content as a Service is a way to provide raw content (in other words, without the need for a specific human compatible representation, such as HTML) in a way that other systems can make use of it. Content as a Service is not meant for direct human consumption, but rather for other platforms to consume and make use of the content according to their particular needs. This happens usually on the cloud, with a centralized platform which can be globally accessible and provides a standard format for your content. With Content as a Service, you centralize your content into a single repository, where you can manage it, categorize it, make it available to others, search for it, or do whatever you wish with it. == Overview == The content delivered typically could be one or more of the following The technical terminology related to equipment or spares that is required to procure or design the materials The industrial terminology of the equipment or spares Technical values pertaining to various types, specifications, applications, characteristics of equipment or spares Sourcing information which will help in procurement or supply-chain management of equipment or spares Descriptive specifications of equipment or spares based on the product reference number or identifier UNSPSC codes or industry practiced classifications ISO, IEC compliant terminology Ontology or Technical Dictionary of products & services Predefined content for specific business needs The term "Content as a service" (CaaS) is considered to be part of the nomenclature of cloud computing service models & Service-oriented architecture along with Software as a service (SaaS), Infrastructure as a service (IaaS), and Platform as a service (PaaS).

    Read more →
  • Fitness approximation

    Fitness approximation

    Fitness approximation aims to approximate the objective or fitness functions in evolutionary optimization by building up machine learning models based on data collected from numerical simulations or physical experiments. The machine learning models for fitness approximation are also known as meta-models or surrogates, and evolutionary optimization based on approximated fitness evaluations are also known as surrogate-assisted evolutionary approximation. Fitness approximation in evolutionary optimization can be seen as a sub-area of data-driven evolutionary optimization. == Approximate models in function optimization == === Motivation === In many real-world optimization problems including engineering problems, the number of fitness function evaluations needed to obtain a good solution dominates the optimization cost. In order to obtain efficient optimization algorithms, it is crucial to use prior information gained during the optimization process. Conceptually, a natural approach to utilizing the known prior information is building a model of the fitness function to assist in the selection of candidate solutions for evaluation. A variety of techniques for constructing such a model, often also referred to as surrogates, metamodels or approximation models – for computationally expensive optimization problems have been considered. === Approaches === Common approaches to constructing approximate models based on learning and interpolation from known fitness values of a small population include: Low-degree polynomials and regression models Fourier surrogate modeling Artificial neural networks including Multilayer perceptrons Radial basis function network Support vector machines Due to the limited number of training samples and high dimensionality encountered in engineering design optimization, constructing a globally valid approximate model remains difficult. As a result, evolutionary algorithms using such approximate fitness functions may converge to local optima. Therefore, it can be beneficial to selectively use the original fitness function together with the approximate model.

    Read more →
  • Stress majorization

    Stress majorization

    Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of n {\displaystyle n} m {\displaystyle m} -dimensional data items, a configuration X {\displaystyle X} of n {\displaystyle n} points in r {\displaystyle r} ( ≪ m ) {\displaystyle (\ll m)} -dimensional space is sought that minimizes the so-called stress function σ ( X ) {\displaystyle \sigma (X)} . Usually r {\displaystyle r} is 2 {\displaystyle 2} or 3 {\displaystyle 3} , i.e. the ( n × r ) {\displaystyle (n\times r)} matrix X {\displaystyle X} lists points in 2 − {\displaystyle 2-} or 3 − {\displaystyle 3-} dimensional Euclidean space so that the result may be visualised (i.e. an MDS plot). The function σ {\displaystyle \sigma } is a cost or loss function that measures the squared differences between ideal ( m {\displaystyle m} -dimensional) distances and actual distances in r-dimensional space. It is defined as: σ ( X ) = ∑ i < j ≤ n w i j ( d i j ( X ) − δ i j ) 2 {\displaystyle \sigma (X)=\sum _{i Read more →

  • Canonical correspondence analysis

    Canonical correspondence analysis

    In multivariate analysis, canonical correspondence analysis (CCA) is an ordination technique that determines axes from the response data as a unimodal combination of measured predictors. CCA is commonly used in ecology in order to extract gradients that drive the composition of ecological communities. CCA extends correspondence analysis (CA) with regression, in order to incorporate predictor variables. == History == CCA was developed in 1986 by Cajo ter Braak and implemented in the program CANOCO, an extension of DECORANA. To date, CCA is one of the most popular multivariate methods in ecology, despite the availability of contemporary alternatives. CCA was originally derived and implemented using an algorithm of weighted averaging, though Legendre & Legendre (1998) derived an alternative algorithm. == Assumptions == The requirements of a CCA are that the samples are random and independent. Also, the data are categorical and that the independent variables are consistent within the sample site and error-free. The original publication states the need for equal species tolerances, equal species maxima, and equispaced or uniformly distributed species optima and site scores.

    Read more →
  • Bigram

    Bigram

    A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition. Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar). == Applications == Bigrams, along with other n-grams, are used in most successful language models for speech recognition. Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis. Bigram frequency is one approach to statistical language identification. Some activities in logology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram, or words containing a string of repeated bigrams, such as logogogue. == Bigram frequency in the English language == The frequency of the most common letter bigrams in a large English corpus is: th 3.56% of 1.17% io 0.83% he 3.07% ed 1.17% le 0.83% in 2.43% is 1.13% ve 0.83% er 2.05% it 1.12% co 0.79% an 1.99% al 1.09% me 0.79% re 1.85% ar 1.07% de 0.76% on 1.76% st 1.05% hi 0.76% at 1.49% to 1.05% ri 0.73% en 1.45% nt 1.04% ro 0.73% nd 1.35% ng 0.95% ic 0.70% ti 1.34% se 0.93% ne 0.69% es 1.34% ha 0.93% ea 0.69% or 1.28% as 0.87% ra 0.69% te 1.20% ou 0.87% ce 0.65%

    Read more →
  • Implicit blockmodeling

    Implicit blockmodeling

    Implicit blockmodeling is an approach in blockmodeling, similar to a valued and homogeneity blockmodeling, where initially an additional normalization is used and then while specifying the parameter of the relevant link is replaced by the block maximum. This approach was first proposed by Batagelj and Ferligoj in 2000, and developed by Aleš Žiberna in 2007/08. Comparing with homogeneity, the implicit blockmodeling will perform similarly with max-regular equivalence, but slightly worse in other settings. It will perform worse than valued and homogeneity blockmodeling with a pre-specified blockmodel.

    Read more →
  • Cross-entropy

    Cross-entropy

    In information theory, the cross-entropy between two probability distributions p {\displaystyle p} and q {\displaystyle q} , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution q {\displaystyle q} , rather than the true distribution p {\displaystyle p} . == Definition == The cross-entropy of the distribution q {\displaystyle q} relative to a distribution p {\displaystyle p} over a given set is defined as follows: H ( p , q ) = − E p ⁡ [ log ⁡ q ] , {\displaystyle H(p,q)=-\operatorname {E} _{p}[\log q],} where E p ⁡ [ ⋅ ] {\displaystyle \operatorname {E} _{p}[\cdot ]} is the expected value operator with respect to the distribution p {\displaystyle p} . The definition may be formulated using the Kullback–Leibler divergence D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , divergence of p {\displaystyle p} from q {\displaystyle q} (also known as the relative entropy of p {\displaystyle p} with respect to q {\displaystyle q} ). H ( p , q ) = H ( p ) + D K L ( p ∥ q ) , {\displaystyle H(p,q)=H(p)+D_{\mathrm {KL} }(p\parallel q),} where H ( p ) {\displaystyle H(p)} is the entropy of p {\displaystyle p} . For discrete probability distributions p {\displaystyle p} and q {\displaystyle q} with the same support X {\displaystyle {\mathcal {X}}} , this means The situation for continuous distributions is analogous. We have to assume that p {\displaystyle p} and q {\displaystyle q} are absolutely continuous with respect to some reference measure r {\displaystyle r} (usually r {\displaystyle r} is a Lebesgue measure on a Borel σ-algebra). Let P {\displaystyle P} and Q {\displaystyle Q} be probability density functions of p {\displaystyle p} and q {\displaystyle q} with respect to r {\displaystyle r} . Then − ∫ X P ( x ) log ⁡ Q ( x ) d x = E p ⁡ [ − log ⁡ Q ] , {\displaystyle -\int _{\mathcal {X}}P(x)\,\log Q(x)\,\mathrm {d} x=\operatorname {E} _{p}[-\log Q],} and therefore NB: The notation H ( p , q ) {\displaystyle H(p,q)} is also used for a different concept, the joint entropy of p {\displaystyle p} and q {\displaystyle q} . == Motivation == In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value x i {\displaystyle x_{i}} out of a set of possibilities { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} can be seen as representing an implicit probability distribution q ( x i ) = ( 1 2 ) ℓ i {\displaystyle q(x_{i})=\left({\frac {1}{2}}\right)^{\ell _{i}}} over { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} , where ℓ i {\displaystyle \ell _{i}} is the length of the code for x i {\displaystyle x_{i}} in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution q {\displaystyle q} is assumed while the data actually follows a distribution p {\displaystyle p} . That is why the expectation is taken over the true probability distribution p {\displaystyle p} and not q . {\displaystyle q.} Indeed the expected message-length under the true distribution p {\displaystyle p} is E p ⁡ [ ℓ ] = − E p ⁡ [ ln ⁡ q ( x ) ln ⁡ ( 2 ) ] = − E p ⁡ [ log 2 ⁡ q ( x ) ] = − ∑ x i p ( x i ) log 2 ⁡ q ( x i ) = − ∑ x p ( x ) log 2 ⁡ q ( x ) = H ( p , q ) . {\displaystyle {\begin{aligned}\operatorname {E} _{p}[\ell ]&=-\operatorname {E} _{p}\left[{\frac {\ln {q(x)}}{\ln(2)}}\right]\\[1ex]&=-\operatorname {E} _{p}\left[\log _{2}{q(x)}\right]\\[1ex]&=-\sum _{x_{i}}p(x_{i})\,\log _{2}q(x_{i})\\[1ex]&=-\sum _{x}p(x)\,\log _{2}q(x)=H(p,q).\end{aligned}}} == Estimation == There are many situations where cross-entropy needs to be measured but the distribution of p {\displaystyle p} is unknown. An example is language modeling, where a model is created based on a training set T {\displaystyle T} , and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data. In this example, p {\displaystyle p} is the true distribution of words in any corpus, and q {\displaystyle q} is the distribution of words as predicted by the model. Since the true distribution is unknown, cross-entropy cannot be directly calculated. In these cases, an estimate of cross-entropy is calculated using the following formula: H ( T , q ) = − ∑ i = 1 N 1 N log 2 ⁡ q ( x i ) {\displaystyle H(T,q)=-\sum _{i=1}^{N}{\frac {1}{N}}\log _{2}q(x_{i})} where N {\displaystyle N} is the size of the test set, and q ( x ) {\displaystyle q(x)} is the probability of event x {\displaystyle x} estimated from the training set. In other words, q ( x i ) {\displaystyle q(x_{i})} is the probability estimate of the model that the i-th word of the text is x i {\displaystyle x_{i}} . The sum is averaged over the N {\displaystyle N} words of the test. This is a Monte Carlo estimate of the true cross-entropy, where the test set is treated as samples from p ( x ) {\displaystyle p(x)} . == Relation to maximum likelihood == The cross entropy arises in classification problems when introducing a logarithm in the guise of the log-likelihood function. This section concerns the estimation of the probabilities of different discrete outcomes. To this end, denote a parametrized family of distributions by q θ {\displaystyle q_{\theta }} , with θ {\displaystyle \theta } subject to the optimization effort. Consider a given finite sequence of N {\displaystyle N} values x i {\displaystyle x_{i}} from a training set, obtained from conditionally independent sampling. The likelihood assigned to any considered parameter θ {\displaystyle \theta } of the model is then given by the product over all probabilities q θ ( X = x i ) {\displaystyle q_{\theta }(X=x_{i})} . Repeated occurrences are possible, leading to equal factors in the product. If the count of occurrences of the value equal to x {\displaystyle x} is denoted by # x {\displaystyle \#x} , then the frequency of that value equals # x / N {\displaystyle \#x/N} . If p ( X = x ) {\displaystyle p(X=x)} is the underlying probability distribution, for large N {\displaystyle N} we expect p ( X = x ) ≈ # x / N {\displaystyle p(X=x)\approx \#x/N} , by the law of large numbers. Writing our likelihood function as the product of observations from the distribution q θ {\displaystyle q_{\theta }} : L ( θ ; x ) = ∏ i q θ ( X = x i ) = ∏ x q θ ( X = x ) # x ≈ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) = exp ⁡ log ⁡ [ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) ] = exp ⁡ ( ∑ x N ⋅ p ( X = x ) log ⁡ q θ ( X = x ) ) , {\displaystyle {\begin{aligned}{\mathcal {L}}(\theta ;{\mathbf {x} })&=\prod _{i}q_{\theta }(X=x_{i})=\prod _{x}q_{\theta }(X=x)^{\#x}\\&\approx \prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}=\exp \log \left[\prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}\right]\\&=\exp \left(\sum _{x}N\cdot p(X=x)\log q_{\theta }(X=x)^{}\right),\end{aligned}}} where we have used the calculation rules for the logarithm in the final line. Notice how the exponent contains a − H ( p , q θ ) {\displaystyle -H(p,q_{\theta })} term. Taking the logarithm of both sides gives: log ⁡ L ( θ ; x ) = − N ⋅ H ( p , q θ ) . {\displaystyle \log {\mathcal {L}}(\theta ;{\mathbf {x} })=-N\cdot H(p,q_{\theta }).} Since the logarithm is a monotonically increasing function, the maximizing value of θ {\displaystyle \theta } is unaffected by this final step. Similarly, the maximizing value of θ {\displaystyle \theta } is unaffected by the factor of N {\displaystyle N} . So we observe that the likelihood maximization amounts to minimization of the cross-entropy. == Cross-entropy minimization == Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution q {\displaystyle q} against a fixed reference distribution p {\displaystyle p} , cross-entropy and KL divergence are identical up to an additive constant (since p {\displaystyle p} is fixed): According to the Gibbs' inequality, both take on their minimal values when p = q {\displaystyle p=q} , which is 0 {\displaystyle 0} for KL divergence, and H ( p ) {\displaystyle \mathrm {H} (p)} for cross-entropy. In the engineering literature, the principle of minimizing KL divergence (Kullback's "Principle of Minimum Discrimination Information") is often called the Principle of Minimum Cross-Entropy (MCE), or Minxent. However, as discussed in the article Kullback–Leibler divergence, sometimes the distribution q {\displaystyle q} is the fixed prior reference distribution, and the distribution p {\displaystyle p} is optimized to be as close to q {\displaystyle q} as possible, subject to some constraint. In this case the two minimizations are not equivalent. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by restating cross-entropy to be D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , rather than H (

    Read more →