AI Essay Writer

AI Essay Writer — hands-on reviews, top picks, pricing, pros and cons and a practical how-to guide on Aizhi.

  • Group of Governmental Experts on Lethal Autonomous Weapons Systems

    Group of Governmental Experts on Lethal Autonomous Weapons Systems

    The Group of Governmental Experts on Lethal Autonomous Weapons Systems, commonly known as the GGE on LAWS, refers to a group of governmental experts established under the framework of the Convention on Certain Conventional Weapons (CCW), a United Nations arms control framework. The group examines legal, ethical, societal and moral questions that arise from the increased use of autonomous robots to carry weapons and to be programmed to engage in combat in various situations that might arise, including battles between countries, or in patrolling border areas or sensitive areas, or other similar roles. As of 18 March 2025, the Convention on Certain Conventional Weapons had 128 High Contracting Parties. In the Geneva Conventions, the term "High Contracting Parties" refers to the states that have joined the conventions and are therefore bound to uphold them. Among the countries that have joined are states with tense relations or ongoing armed conflict with one another, including Russia and Ukraine, Israel and the State of Palestine, and Pakistan and Afghanistan. == Background == In 2013, the Meeting of State Parties to the Convention on Certain Conventional Weapons agreed on a mandate on lethal autonomous weapon systems and tasked its chairperson with convening an informal Meeting of Experts to discuss issues related to emerging technologies in the area of LAWS. Those informal Meetings of Experts were then held in 2014, 2015 and 2016, and their reports fed into subsequent meetings of the High Contracting Parties. At the Fifth CCW Review Conference in 2016, the High Contracting Parties decided to establish an open-ended Group of Governmental Experts on emerging technologies in the area of LAWS, building on the earlier expert meetings. Since then, the group has been reconvened annually. In 2023, the Meeting of the High Contracting Parties to the CCW decided that the GGE on LAWS would continue its work in 2024 and 2025. The group was tasked with developing, by consensus, elements of a possible instrument, without predetermining its form, as well as other measures addressing lethal autonomous weapon systems, drawing on existing CCW protocols, earlier recommendations, state proposals, and legal, military, and technological expertise. == 2024 == In 2024, the GGE met twice, and the group was chaired by Robert in den Bosch, the Netherlands' disarmament ambassador. The 2024 Meeting of the High Contracting Parties decided that the group would meet for 10 days in 2025, in two five-day sessions, and reaffirmed its mandate to continue work by consensus on possible elements of an instrument and other measures addressing lethal autonomous weapon systems. == 2025 == At its first 2025 session, held in Geneva from 3 to 7 March 2025, the Group of Governmental Experts on Lethal Autonomous Weapon Systems discussed revisions to the chair's rolling text. The text was structured into five sections, or "boxes", though delegates held differing views on whether headings were useful or appropriate. Broadly, the discussions covered the characterization of lethal autonomous weapon systems, the application of international humanitarian law, possible prohibitions and regulations, legal review, and questions of accountability and responsibility. At its second session, held from 1 to 5 September 2025, delegations continued work on the chair's rolling text, which set out elements of a possible instrument and was organized into five thematic "boxes". == 2026 == === Developments before the 2026 session === A few weeks before the meeting, autonomous weapons drew renewed attention when the United States pressured Anthropic to revise the terms of use for its AI model Claude. Anthropic prohibited the model's use for mass domestic surveillance and for fully autonomous weapons operating without human oversight, while reports also emerged that OpenAI had reached an agreement with the U.S. Department of War for the use of its AI models, reportedly stipulating that they would not independently direct autonomous weapons where human control was required. The U.S. military nevertheless continued to use Claude during its war on Iran, and there was increasing alarm about the use of AI-assisted semi-autonomous weapons in conflicts including those in Ukraine, Sudan, Gaza, and Iran. Before the start of the sessions, Robert in den Bosch, as chair, warned that progress was urgent because technological developments were moving quickly. At the same time, although states agreed that international humanitarian law applied to LAWS, specific internationally binding standards governing such systems remained largely absent. A key divide before the session was that Russia and the United States opposed new legally binding instruments, while other states argued that new rules were necessary. According to Robert in den Bosch, the talks could lead to new rules, amendments to an existing convention, or a new treaty. === First session === From 2 to 6 March 2026, the group held its penultimate session under the group's three-year mandate. Delegations discussed the chair's rolling draft text, circulated in December 2025, on elements of a possible instrument or other measures concerning lethal autonomous weapon systems. In revised text circulated by the chair on 5 March 2026, a lethal autonomous weapon system was characterized as "a functionally integrated combination of one or more weapons and technological components, that can identify, select, and engage a target, without intervention by a human operator in the execution of these tasks". The text was divided into five boxes to structure discussion. During the session, delegates conducted a first reading of the draft text, and the chair later circulated revised language for several sections. Informal consultations were also held. According to campaign groups and participating observers, support grew during the week for moving to negotiations on the basis of the rolling text, with more than 70 states said to support that step by the end of the session, though some participants warned that attempts to bridge differences risked blurring the group's core purpose. The International Committee of the Red Cross argued that the text should not only restate existing international humanitarian law, but also clarify how those rules apply to autonomous weapons and set out additional measures tailored to the specific challenges such systems raise. Stop Killer Robots likewise emphasized the need to preserve meaningful human judgment and control over increasingly autonomous systems. During the discussions, the U.S. delegation opposed the term "human control" and reportedly proposed the alternative phrase "good faith human judgment and care". Other delegations rejected that wording as too weak, while many states continued to insist that meaningful human control over weapon systems remained essential.

    Read more →
  • Policy gradient method

    Policy gradient method

    Policy gradient methods are a class of reinforcement learning algorithms and a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } . == Overview == In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( ⋅ ∣ s ) {\displaystyle \pi _{\theta }(\cdot \mid s)} . If the action space is discrete, then ∑ a π θ ( a ∣ s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1} . If the action space is continuous, then ∫ a π θ ( a ∣ s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1} . The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ ∑ t = 0 T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite). The policy gradient is defined as ∇ θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". == REINFORCE == === Policy gradient === The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first policy gradient method. It is based on the identity for the policy gradient ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ t = 0 T ( γ t R t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t=0}^{T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick" ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ τ = t T ( γ τ R τ ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]} Thus, we have an unbiased estimator of the policy gradient: ∇ θ J ( θ ) ≈ 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ − t R τ , n ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau -t}R_{\tau ,n})\right]} where the index n {\displaystyle n} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} . The score function ∇ θ ln ⁡ π θ ( A t ∣ S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === The REINFORCE algorithm is a loop: Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function. Compute the policy gradient estimation: g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ R τ , n ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})\right]} Update the policy by gradient ascent: θ i + 1 ← θ i + α i g i {\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}} Here, α i {\displaystyle \alpha _{i}} is the learning rate at update step i {\displaystyle i} . == Variance reduction == REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title of variance reduction. === REINFORCE with baseline === A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − b ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]} for any function b : States → R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n | S t , n ) ( ∑ τ = t T ( γ τ R τ , n ) − b i ( S t , n ) ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case where b i ≡ 0 {\displaystyle b_{i}\equiv 0} . === Actor-critic methods === If b i {\textstyle b_{i}} is chosen well, such that b i ( S t ) ≈ ∑ τ = t T ( γ τ R τ ) = γ t V π θ i ( S t ) {\textstyle b_{i}(S_{t})\approx \sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − γ t V π θ ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic. The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Q π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) − V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ A π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} In summary, there are many unbiased estimators for ∇ θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Ψ t | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\su

    Read more →
  • Promoter based genetic algorithm

    Promoter based genetic algorithm

    The promoter based genetic algorithm (PBGA) is a genetic algorithm for neuroevolution developed by F. Bellas and R.J. Duro in the Integrated Group for Engineering Research (GII) at the University of Coruña, in Spain. It evolves variable size feedforward artificial neural networks (ANN) that are encoded into sequences of genes for constructing a basic ANN unit. Each of these blocks is preceded by a gene promoter acting as an on/off switch that determines if that particular unit will be expressed or not. == PBGA basics == The basic unit in the PBGA is a neuron with all of its inbound connections as represented in the following figure: The genotype of a basic unit is a set of real valued weights followed by the parameters of the neuron and proceeded by an integer valued field that determines the promoter gene value and, consequently, the expression of the unit. By concatenating units of this type we can construct the whole network. With this encoding it is imposed that the information that is not expressed is still carried by the genotype in evolution but it is shielded from direct selective pressure, maintaining this way the diversity in the population, which has been a design premise for this algorithm. Therefore, a clear difference is established between the search space and the solution space, permitting information learned and encoded into the genotypic representation to be preserved by disabling promoter genes. == Results == The PBGA was originally presented within the field of autonomous robotics, in particular in the real time learning of environment models of the robot. It has been used inside the Multilevel Darwinist Brain (MDB) cognitive mechanism developed in the GII for real robots on-line learning. In another paper it is shown how the application of the PBGA together with an external memory that stores the successful obtained world models, is an optimal strategy for adaptation in dynamic environments. Recently, the PBGA has provided results that outperform other neuroevolutionary algorithms in non-stationary problems, where the fitness function varies in time.

    Read more →
  • Naive Bayes classifier

    Naive Bayes classifier

    In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of "probabilistic classifiers" which assume that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes the information about the class provided by each variable is unrelated to the information from the others, with no information shared between the predictors. The highly unrealistic nature of this assumption, called the naive independence assumption, is what gives the classifier its name. These classifiers are some of the simplest Bayesian network models. Naive Bayes classifiers generally perform worse than more advanced models like logistic regressions, especially at quantifying uncertainty (with naive Bayes models often producing wildly overconfident probabilities). However, they are highly scalable, requiring only one parameter for each feature or predictor in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression (simply by counting observations in each group), rather than the expensive iterative approximation algorithms required by most other models. Despite the use of Bayes' theorem in the classifier's decision rule, naive Bayes is not (necessarily) a Bayesian method, and naive Bayes models can be fit to data using either Bayesian or frequentist methods. == Introduction == Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. == Probabilistic model == Abstractly, naive Bayes is a conditional probability model: it assigns probabilities p ( C k ∣ x 1 , … , x n ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})} for each of the K possible outcomes or classes C k {\displaystyle C_{k}} given a problem instance to be classified, represented by a vector x = ( x 1 , … , x n ) {\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})} encoding some n features (independent variables). The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. The model must therefore be reformulated to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as: p ( C k ∣ x ) = p ( C k ) p ( x ∣ C k ) p ( x ) {\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,} In plain English, using Bayesian probability terminology, the above equation can be written as posterior = prior × likelihood evidence {\displaystyle {\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{evidence}}}\,} In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C {\displaystyle C} and the values of the features x i {\displaystyle x_{i}} are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p ( C k , x 1 , … , x n ) {\displaystyle p(C_{k},x_{1},\ldots ,x_{n})\,} which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: p ( C k , x 1 , … , x n ) = p ( x 1 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) p ( x 3 , … , x n , C k ) = ⋯ = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) ⋯ p ( x n − 1 ∣ x n , C k ) p ( x n ∣ C k ) p ( C k ) {\displaystyle {\begin{aligned}p(C_{k},x_{1},\ldots ,x_{n})&=p(x_{1},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\ p(x_{3},\ldots ,x_{n},C_{k})\\&=\cdots \\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\cdots p(x_{n-1}\mid x_{n},C_{k})\ p(x_{n}\mid C_{k})\ p(C_{k})\\\end{aligned}}} Now the "naive" conditional independence assumptions come into play: assume that all features in x {\displaystyle \mathbf {x} } are mutually independent, conditional on the category C k {\displaystyle C_{k}} . Under this assumption, p ( x i ∣ x i + 1 , … , x n , C k ) = p ( x i ∣ C k ) . {\displaystyle p(x_{i}\mid x_{i+1},\ldots ,x_{n},C_{k})=p(x_{i}\mid C_{k})\,.} Thus, the joint model can be expressed as p ( C k ∣ x 1 , … , x n ) ∝ p ( C k , x 1 , … , x n ) = p ( C k ) p ( x 1 ∣ C k ) p ( x 2 ∣ C k ) p ( x 3 ∣ C k ) ⋯ = p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) , {\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\ldots ,x_{n})\varpropto \ &p(C_{k},x_{1},\ldots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}} where ∝ {\displaystyle \varpropto } denotes proportionality since the denominator p ( x ) {\displaystyle p(\mathbf {x} )} is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable C {\displaystyle C} is: p ( C k ∣ x 1 , … , x n ) = 1 Z p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})={\frac {1}{Z}}\ p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})} where the evidence Z = p ( x ) = ∑ k p ( C k ) p ( x ∣ C k ) {\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})} is a scaling factor dependent only on x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , that is, a constant if the values of the feature variables are known. Often, it is only necessary to discriminate between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor: ln ⁡ p ( C k ∣ x 1 , … , x n ) = ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) − ln ⁡ Z ⏟ irrelevant {\displaystyle \ln p(C_{k}\mid x_{1},\ldots ,x_{n})=\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\underbrace {-\ln Z} _{\text{irrelevant}}} The scaling factor is irrelevant, since discrimination subtracts it away: ln ⁡ p ( C k ∣ x 1 , … , x n ) p ( C l ∣ x 1 , … , x n ) = ( ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) ) − ( ln ⁡ p ( C l ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C l ) ) {\displaystyle \ln {\frac {p(C_{k}\mid x_{1},\ldots ,x_{n})}{p(C_{l}\mid x_{1},\ldots ,x_{n})}}=\left(\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\right)-\left(\ln p(C_{l})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{l})\right)} There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in nats. Another is that it avoids arithmetic underflow. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k {\displaystyle {\hat {y}}=C_{k}} for some k as follows: y ^ = argmax k ∈ { 1 , … , K } p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) . {\displaystyle {\hat {y}}={\underset {k\in \{1,\ldots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displays

    Read more →
  • Evaluation of binary classifiers

    Evaluation of binary classifiers

    Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake. There are many metrics that can be used; different fields have different preferences. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent of the prevalence or skew (how often each class occurs in the population), and metrics that depend on the prevalence – both types are useful, but they have very different properties. Often, evaluation is used to compare two methods of classification, so that one can be adopted and the other discarded. Such comparisons are more directly achieved by a form of evaluation that results in a single unitary metric rather than a pair of metrics. == Contingency table == Given a data set, a classification (the output of a classifier on that set) gives two numbers: the number of positives and the number of negatives, which add up to the total size of the set. To evaluate a classifier, one compares its output to another reference classification – ideally a perfect classification, but in practice the output of another gold standard test – and cross tabulates the data into a 2×2 contingency table, comparing the two classifications. One then evaluates the classifier relative to the gold standard by computing summary statistics of these 4 numbers. Generally these statistics will be scale invariant (scaling all the numbers by the same factor does not change the output), to make them independent of population size, which is achieved by using ratios of homogeneous functions, most simply homogeneous linear or homogeneous quadratic functions. Say we test some people for the presence of a disease. Some of these people have the disease, and our test correctly says they are positive. They are called true positives (TP). Some have the disease, but the test incorrectly claims they don't. They are called false negatives (FN). Some don't have the disease, and the test says they don't – true negatives (TN). Finally, there might be healthy people who have a positive test result – false positives (FP). These can be arranged into a 2×2 contingency table (confusion matrix), conventionally with the test result on the vertical axis and the actual condition on the horizontal axis. These numbers can then be totaled, yielding both a grand total and marginal totals. Totaling the entire table, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set. Totaling the columns (adding vertically) the number of true positives and false positives add up to 100% of the test positives, and likewise for negatives. Totaling the rows (adding horizontally), the number of true positives and false negatives add up to 100% of the condition positives (conversely for negatives). The basic marginal ratio statistics are obtained by dividing the 2×2=4 values in the table by the marginal totals (either rows or columns), yielding 2 auxiliary 2×2 tables, for a total of 8 ratios. These ratios come in 4 complementary pairs, each pair summing to 1, and so each of these derived 2×2 tables can be summarized as a pair of 2 numbers, together with their complements. Further statistics can be obtained by taking ratios of these ratios, ratios of ratios, or more complicated functions. The contingency table and the most common derived ratios are summarized below; see sequel for details. Note that the rows correspond to the condition actually being positive or negative (or classified as such by the gold standard), as indicated by the color-coding, and the associated statistics are prevalence-independent, while the columns correspond to the test being positive or negative, and the associated statistics are prevalence-dependent. There are analogous likelihood ratios for prediction values, but these are less commonly used, and not depicted above. == Pairs of metrics == Often accuracy is evaluated with a pair of metrics composed in a standard pattern. === Sensitivity and specificity === The fundamental prevalence-independent statistics are sensitivity and specificity. Sensitivity or True Positive Rate (TPR), also known as recall, is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive (Condition Positive, CP = TP + FN). It can be seen as the probability that the test is positive given that the patient is sick. With higher sensitivity, fewer actual cases of disease go undetected (or, in the case of the factory quality control, fewer faulty products go to the market). Specificity (SPC) or True Negative Rate (TNR) is the proportion of people that tested negative and are negative (True Negative, TN) of all the people that actually are negative (Condition Negative, CN = TN + FP). As with sensitivity, it can be looked at as the probability that the test result is negative given that the patient is not sick. With higher specificity, fewer healthy people are labeled as sick (or, in the factory case, fewer good products are discarded). The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the Receiver Operating Characteristic (ROC) curve. In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100% in both (such as in the red/blue ball example given above). In more practical, less contrived instances, however, there is usually a trade-off, such that they are inversely proportional to one another to some extent. This is because we rarely measure the actual thing we would like to classify; rather, we generally measure an indicator of the thing we would like to classify, referred to as a surrogate marker. The reason why 100% is achievable in the ball example is because redness and blueness is determined by directly detecting redness and blueness. However, indicators are sometimes compromised, such as when non-indicators mimic indicators or when indicators are time-dependent, only becoming evident after a certain lag time. The following example of a pregnancy test will make use of such an indicator. Modern pregnancy tests do not use the pregnancy itself to determine pregnancy status; rather, human chorionic gonadotropin is used, or hCG, present in the urine of gravid females, as a surrogate marker to indicate that a woman is pregnant. Because hCG can also be produced by a tumor, the specificity of modern pregnancy tests cannot be 100% (because false positives are possible). Also, because hCG is present in the urine in such small concentrations after fertilization and early embryogenesis, the sensitivity of modern pregnancy tests cannot be 100% (because false negatives are possible). === Positive and negative predictive values === In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive predictive value (PPV), also known as precision, and negative predictive value (NPV). The positive prediction value answers the question "If the test result is positive, how well does that predict an actual presence of disease?". It is calculated as TP/(TP + FP); that is, it is the proportion of true positives out of all positive results. The negative prediction value is the same, but for negatives, naturally. ==== Impact of prevalence on predictive values ==== Prevalence has a significant impact on prediction values. As an example, suppose there is a test for a disease with 99% sensitivity and 99% specificity. If 2000 people are tested and the prevalence (in the sample) is 50%, 1000 of them are sick and 1000 of them are healthy. Thus about 990 true positives and 990 true negatives are likely, with 10 false positives and 10 false negatives. The positive and negative prediction values would be 99%, so there can be high confidence in the result. However, if the prevalence is only 5%, so of the 2000 people only 100 are really sick, then the prediction values change significantly. The likely result is 99 true positives, 1 false negative, 1881 true negatives and 19 false positives. Of the 19+99 people tested positive, only 99 really have the disease – that means, intuitively, that given that a patient's test result is positive, there is only 84% chance that they really have the disease. On the other hand, given that the patient's test result is negative, there is only 1 chance in 1882, or 0.05% probability, that the patient has the disease despite the test result. === Precision and recall === Precision and recall can be interpreted as (estimated) conditional probabilities: Precision is given by P ( C = P | C ^ = P ) {\displaystyle P(C=P|{\hat {C}}=P)} while recall is given by P ( C ^ = P | C = P ) {\displaystyle P({\hat {C}}=P|C=P)} , where C ^ {\

    Read more →
  • Wolfram Mathematica

    Wolfram Mathematica

    Wolfram Mathematica (also known as Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimization, plotting functions and various types of data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other programming languages. It was conceived by Stephen Wolfram, and is developed by Wolfram Research of Champaign, Illinois. The Wolfram Language is the programming language used in Mathematica. Mathematica 1.0 was released on June 23, 1988 in Champaign, Illinois and Santa Clara, California. Mathematica's Wolfram Language is fundamentally based on Lisp; for example, the Mathematica command Most is identically equal to the Lisp command butlast. == Notebook interface == Mathematica is split into two parts: the kernel and the front end. The kernel interprets expressions (Wolfram Language code) and returns result expressions, which can then be displayed by the front end. The original front end, designed by Theodore Gray in 1988, consists of a notebook interface and allows the creation and editing of notebook documents that can contain code, plaintext, images, and graphics. Code development is also supported through support in a range of standard integrated development environment (IDE) including Eclipse, IntelliJ IDEA, Atom, Vim, Visual Studio Code and Git. The Mathematica Kernel also includes a command line front end. Other interfaces include JMath, based on GNU Readline and WolframScript which runs self-contained Mathematica programs (with arguments) from the UNIX command line. == High-performance computing == Capabilities for high-performance computing were extended with the introduction of packed arrays in version 4 (1999) and sparse matrices (version 5, 2003), and by adopting the GNU Multiple Precision Arithmetic Library to evaluate high-precision arithmetic. Version 5.2 (2005) added automatic multi-threading when computations are performed on multi-core computers. This release included CPU-specific optimized libraries. In addition Mathematica is supported by third party specialist acceleration hardware such as ClearSpeed. In 2002, gridMathematica was introduced to allow user level parallel programming on heterogeneous clusters and multiprocessor systems and in 2008 parallel computing technology was included in all Mathematica licenses including support for grid technology such as Windows HPC Server 2008, Microsoft Compute Cluster Server and Sun Grid. Support for CUDA and OpenCL GPU hardware was added in 2010. == Extensions == As of Version 14, there are 6,602 built-in functions and symbols in the Wolfram Language. Stephen Wolfram announced the launch of the Wolfram Function Repository in June 2019 as a way for the public Wolfram community to contribute functionality to the Wolfram Language. There are currently more than 3000 functions contributed as Resource Functions. In addition to the Wolfram Function Repository, there is a Wolfram Data Repository with computable data and the Wolfram Neural Net Repository for machine learning. Wolfram Mathematica is the basis of the Combinatorica package, which adds discrete mathematics functionality in combinatorics and graph theory to the program. == Connections to other applications, programming languages, and services == Communication with other applications can be done using a protocol called Wolfram Symbolic Transfer Protocol (WSTP). It allows communication between the Wolfram Mathematica kernel and the front end and provides a general interface between the kernel and other applications. Wolfram Research freely distributes a developer kit for linking applications written in the programming language C to the Mathematica kernel through WSTP using J/Link., a Java program that can ask Mathematica to perform computations. Similar functionality is achieved with .NET /Link, but with .NET programs instead of Java programs. Other languages that connect to Mathematica include Haskell, AppleScript, Racket, Visual Basic, Python, and Clojure. Mathematica supports the generation and execution of Modelica models for systems modeling and connects with Wolfram System Modeler. Links are also available to many third-party software packages and APIs. Mathematica can also capture real-time data from a variety of sources and can read and write to public blockchains (Bitcoin, Ethereum, and ARK). It supports import and export of over 220 data, image, video, sound, computer-aided design (CAD), geographic information systems (GIS), document, and biomedical formats. In 2019, support was added for compiling Wolfram Language code to LLVM. Version 12.3 of the Wolfram Language added support for Arduino. == Computable data == Mathematica is also integrated with Wolfram Alpha, an online answer engine that provides additional data, some of which is kept updated in real time, for users who use Mathematica with an internet connection. Some of the data sets include astronomical, chemical, geopolitical, language, biomedical, airplane, and weather data, in addition to mathematical data (such as knots and polyhedra). == Reception == BYTE in 1989 listed Mathematica as among the "Distinction" winners of the BYTE Awards, stating that it "is another breakthrough Macintosh application ... it could enable you to absorb the algebra and calculus that seemed impossible to comprehend from a textbook". Mathematica has been criticized for being closed source. Wolfram Research claims keeping Mathematica closed source is central to its business model and the continuity of the software.

    Read more →
  • Sharpness aware minimization

    Sharpness aware minimization

    Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • UpScrolled

    UpScrolled

    UpScrolled is an Australian social media platform for microblogging and short-form online video sharing that was launched in June 2025 by Recursive Methods Pty Ltd. It was founded by Issam Hijazi. == History == UpScrolled was launched in June 2025 by Recursive Methods Pty Ltd. It was founded by Issam Hijazi, a Palestinian-Australian app developer. UpScrolled is backed by the Tech for Palestine incubator. In January 2026, UpScrolled saw increased attention and number of downloads after the acquisition of TikTok by a group of pro-Donald Trump US investors, including Larry Ellison, which led to calls to boycott TikTok and migrate to other apps. TikTok was alleged to be suppressing pro-Palestinian content, as well as news surrounding the killing of Alex Pretti in Minneapolis on the platform. UpScrolled subsequently climbed to the top 10 of Apple's App Store list of free apps. The app saw a reported 2,850% increase in downloads between 22 and 24 January 2026. As of 27 January 2026, UpScrolled "had been downloaded about 400,000 times in the US and 700,000 globally since launching in June 2025". The app became the most downloaded app in the Apple App store on 29 January 2026, following allegations that TikTok was suppressing videos and content opposed to Immigration and Customs Enforcement (ICE) under its new ownership. By 2 February 2026, UpScrolled had reached 2.5 million users. According to the Google Play Store and the Apple App Store, it has become the most downloaded social media app in the United States and Canada, with rising interest in the United Kingdom, France, Germany and Italy. On 14 February, UpScrolled was suspended from the Google Play Store; the suspension was reverted by 15 February. == Founder == Hijazi was born in Jordan. His parents and grandparents are from Safad, a northern Israeli city near the Lebanese border. He worked for IBM and Oracle prior to starting UpScrolled. Hijazi told Rest of World that he launched UpScrolled in response to Israel's genocide in Gaza which followed the October 7 attacks. He said, "I couldn't take it anymore. I lost family members in Gaza, and I didn't want to be complicit. So I was like, I'm done with this, I want to feel useful. I found this gap in the market, with a lot of people asking why there is no alternative to the Big Tech platforms for their content, which was getting censored." Hijazi also alleges that social media accounts that were posting pro-Palestinian content were getting shadow banned on larger platforms, and alleges that even his account was not exempt from being targeted by censors. Hijazi has further elaborated on the importance of social media independence to further the Palestinian cause. In January 2026, Web Summit Qatar announced that Hijazi would be an opening night speaker. Following the announcement, there was a surge in ticket sales for the summit. Hijazi lives in Sydney with his wife and daughter. He lost 60 family members during the Gaza war. == Features == UpScrolled's algorithm allows users to discover posts based on likes, comments, and shares with time decay and some randomness, all chronologically, with "no manipulation" according to the app's website. UpScrolled has an interface resembling a mix of Instagram and Twitter, allowing users to post and view text posts, photos, and videos. It also lets users send private messages to each other. The app is currently available for iOS and Android devices, with plans to upscale. UpScrolled does not include Israel as an option in its location selection menu. Cities such as Tel Aviv are included under "Occupied Territories of Palestine", and Palestine can also be set as the location. UpScrolled says that it is against censorship and shadow banning, and describes itself as "belong[ing] to the people who use it — not to hidden algorithms or outside agendas". Hijazi said, "The other platforms claim to be free speech platforms. But when it comes to anything on Palestine, that's a different story." UpScrolled states that it "does not tolerate hate speech, propaganda, or bad-faith behaviour, but it also refuses to silence voices quietly or without explanation". == User base and content == Al Jazeera reported that posts expressing pro-Palestinian sentiment or depicting the continued suffering in the Gaza Strip were "flooding" the app. Political and global issues such as the Gaza war are prominent. Content includes updates from the Gaza Freedom Flotilla, posts by doctors working in Gaza, video essays about Palantir’s influence within the military and calls for boycotts of Israel. It has been used by Gazans to crowdfund and record daily life. Celebrity users of UpScrolled include American labour activist Chris Smalls and actor Jacob Berger, both of whom were on the July 2025 Gaza Freedom Flotilla. Political figures have also joined UpScrolled, such as South African politician and Economic Freedom Fighters leader Julius Malema, and Islamic Revolutionary Guard Corps commander Esmail Qaani. One user said that most early users were attracted to the platform for the opportunity to criticize Zionism. The Jewish Telegraphic Agency (JTA) reported that UpScrolled was observed to be "flooded" with antisemitic and anti-Israel content, including Holocaust denial and accusations that Israel carried out the 9/11 attacks. In a statement, UpScrolled said, "Our content moderation hasn't been able to keep up with the massive rise of users this week. We're working with digital rights experts to grow our Trust & Safety team and are beefing up our content moderation to prevent this. We apologise to all impacted users, thank you for being part of Upscrolled." The Times reported in February 2026 that UpScrolled was hosting content that could potentially breach UK law, including antisemitic content and posts promoting Hamas, Hezbollah, Islamic State and Al-Qaeda, as well as footage of the 2019 Christchurch mosque shootings and content praising the perpetrators of the 2019 Halle synagogue shooting and 2018 Pittsburgh synagogue shooting. Antisemitic influencers Lucas Gage, Jake Shields, Stew Peters and Anastasia Maria Loupis have accounts on UpScrolled. UpScrolled’s policies prohibit threats, glorification of harm or support for terrorist or violent groups. Hijazi said harmful content was being uploaded to UpScrolled and the company had expanded its content moderation team and upgraded its technology infrastructure to deal with the issue. In May 2026, Moment magazine said that users had identified some antisemitic content, pornography and extremist videos on the platform. The magazine said there were gaps in content moderation due to the small size of the developer team. == Reception == In January 2026, the Council on American–Islamic Relations (CAIR) praised UpScrolled for "pledging to protect the free flow of ideas on its platform, including both support for and opposition to the Israeli government's human rights abuses." Guy Christensen, a pro-Palestinian social media celebrity, has encouraged his audience to download UpScrolled. Christensen characterized UpScrolled as having "no censorship, no ownership by billionaires who put their interests and biases onto you to control you". He compared the platform to others like TikTok, saying that Israel is behind censorship that wouldn't happen on UpScrolled. Jaigris Hodson, an associate professor of Interdisciplinary Studies at Royal Roads University in Canada, has argued that "Network effects mean that unless UpScrolled continues its explosive growth, people are unlikely to continue to choose it over the more established TikTok. At best, we might see a Twitter/X effect, which is where TikTok will host more pro-U.S. government content creators and those people who want to follow them, and UpScrolled will host more critical content creators and their followers."

    Read more →
  • Constructing skill trees

    Constructing skill trees

    Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP (maximum a posteriori) change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010. == Algorithm == CST consists of mainly three parts;change point detection, alignment and merging. The main focus of CST is online change-point detection. The change-point detection algorithm is used to segment data into skills and uses the sum of discounted reward R t {\displaystyle R_{t}} as the target regression variable. Each skill is assigned an appropriate abstraction. A particle filter is used to control the computational complexity of CST. The change point detection algorithm is implemented as follows. The data for times t ∈ T {\displaystyle t\in T} and models Q with prior p ( q ∈ Q ) {\displaystyle p(q\in Q)} are given. The algorithm is assumed to be able to fit a segment from time j + 1 {\displaystyle j+1} to t using model q with the fit probability P ( j , t , q ) {\displaystyle P(j,t,q)_{}^{}} . A linear regression model with Gaussian noise is used to compute P ( j , t , q ) {\displaystyle P(j,t,q)} . The Gaussian noise prior has mean zero, and variance which follows I n v e r s e G a m m a ( v 2 , u 2 ) {\displaystyle \mathrm {InverseGamma} \left({\frac {v}{2}},{\frac {u}{2}}\right)} . The prior for each weight follows N o r m a l ( 0 , σ 2 δ ) {\displaystyle \mathrm {Normal} (0,\sigma ^{2}\delta )} . The fit probability P ( j , t , q ) {\displaystyle P(j,t,q)} is computed by the following equation. P ( j , t , q ) = π − n 2 δ m | ( A + D ) − 1 | 1 2 u v 2 ( y + u ) u + v 2 Γ ( n + v 2 ) Γ ( v 2 ) {\displaystyle P(j,t,q)={\frac {\pi ^{-{\frac {n}{2}}}}{\delta ^{m}}}\left|(A+D)^{-1}\right|^{\frac {1}{2}}{\frac {u^{\frac {v}{2}}}{(y+u)^{\frac {u+v}{2}}}}{\frac {\Gamma ({\frac {n+v}{2}})}{\Gamma ({\frac {v}{2}})}}} Then, CST compute the probability of the changepoint at time j with model q, P t ( j , q ) {\displaystyle P_{t}(j,q)} and P j MAP {\displaystyle P_{j}^{\text{MAP}}} using a Viterbi algorithm. P t ( j , q ) = ( 1 − G ( t − j − 1 ) ) P ( j , t , q ) p ( q ) P j MAP {\displaystyle P_{t}(j,q)=(1-G(t-j-1))P(j,t,q)p(q)P_{j}^{\text{MAP}}} P j MAP = max i , q P j ( i , q ) g ( j − i ) 1 − G ( j − i − 1 ) , ∀ j < t {\displaystyle P_{j}^{\text{MAP}}=\max _{i,q}{\frac {P_{j}(i,q)g(j-i)}{1-G(j-i-1)}},\forall j Read more →

  • Loss function

    Loss function

    In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss. == Examples == === Regret === Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known. === Quadratic loss function === The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t {\displaystyle t} , then a quadratic loss function is λ ( x ) = C ( t − x ) 2 {\displaystyle \lambda (x)=C(t-x)^{2}\;} for some constant C {\displaystyle C} ; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL). Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function. The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, log-cosh and SMAE losses are used when the data has many large outliers. === 0-1 loss function === In statistics and decision theory, a frequently used loss function is the 0-1 loss function L ( y ^ , y ) = { 0 if y = y ^ 1 if y ≠ y ^ {\displaystyle L({\hat {y}},y)={\begin{cases}0&{\text{if }}y={\hat {y}}\\1&{\text{if }}y\neq {\hat {y}}\end{cases}}} In information theory, this loss function is known as Hamming distortion. == Constructing loss and objective functions == In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture. The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers. Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities and the European subsidies for equalizing unemployment rates among 271 German regions. == Expected loss == In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X {\displaystyle X} . === Statistics === Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms. ==== Frequentist expected loss ==== We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P θ {\displaystyle P_{\theta }} , of the observed data, X {\displaystyle X} . This is also referred to as the risk function of the decision rule δ {\displaystyle \delta } and the parameter θ {\displaystyle \theta } . Here the decision rule depends on the outcome of X {\displaystyle X} . The risk function is given by: R ( θ , δ ) = E θ ⁡ L ( θ , δ ( X ) ) = ∫ X L ( θ , δ ( x ) ) d P θ ( x ) . {\displaystyle R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).} Here, θ {\displaystyle \theta } is a fixed but possibly unknown state of nature, X {\displaystyle X} is a vector of observations stochastically drawn from a population, E θ {\displaystyle \operatorname {E} _{\theta }} is the expectation over all population values of X {\displaystyle X} , d P θ {\displaystyle \mathrm {d} P_{\theta }} is a probability measure over the event space of X {\displaystyle X} (parametrized by θ {\displaystyle \theta } ) and the integral is evaluated over the entire support of X {\displaystyle X} . ==== Bayes Risk ==== In a Bayesian approach, the expectation is calculated using the prior distribution π ∗ {\displaystyle \pi ^{}} of the parameter θ {\displaystyle \theta } : ρ ( π ∗ , a ) = ∫ Θ ∫ X L ( θ , a ( x ) ) d P ( x | θ ) d π ∗ ( θ ) = ∫ X ∫ Θ L ( θ , a ( x ) ) d π ∗ ( θ | x ) d M ( x ) {\displaystyle \rho (\pi ^{},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})} where M ( x ) {\displaystyle M(\mathbf {x} )} is known as the predictive likelihood wherein θ {\displaystyle \theta } has been "integrated out," π ∗ ( θ | x ) {\displaystyle \pi ^{}(\theta |\mathbf {x} )} is the posterior distribution, and the order of integration has been changed. One then should choose the action a ∗ {\displaystyle a^{}} which minimises this expected loss, which is referred to as Bayes Risk. In the latter equation, the integrand inside d x {\displaystyle \mathrm {d} x} is known as the Posterior Risk, and minimising it with respect to decision a {\displaystyle a} also minimizes the overall Bayes Risk. This optimal decision, a ∗ {\displaystyle a^{}} is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ {\displaystyle \theta } , over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ {\displaystyle \theta } . ==== Examples in statistics ==== For a scalar parameter θ {\displaystyle \theta } , a decision function whose output θ ^ {\displaystyle {\hat {\theta }}} is an estimate of θ {\displaystyle \theta } , and a quadratic loss function (squared error loss) L ( θ , θ ^ ) = ( θ − θ ^ ) 2 , {\displaystyle L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},} the risk function becomes the mean squared error of the estimate, R ( θ , θ ^ ) = E θ ⁡ [ ( θ − θ ^ ) 2 ] . {\displaystyle R(\theta ,{\hat {\thet

    Read more →
  • Relief (feature selection)

    Relief (feature selection)

    Relief is an algorithm developed by Kenji Kira and Larry Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems with discrete or numerical features. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection. Alternatively, these scores may be applied as feature weights to guide downstream modeling. Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs. If a feature value difference is observed in a neighboring instance pair with the same class (a 'hit'), the feature score decreases. Alternatively, if a feature value difference is observed in a neighboring instance pair with different class values (a 'miss'), the feature score increases. The original Relief algorithm has since inspired a family of Relief-based feature selection algorithms (RBAs), including the ReliefF algorithm. Beyond the original Relief algorithm, RBAs have been adapted to (1) perform more reliably in noisy problems, (2) generalize to multi-class problems (3) generalize to numerical outcome (i.e. regression) problems, and (4) to make them robust to incomplete (i.e. missing) data. To date, the development of RBA variants and extensions has focused on four areas; (1) improving performance of the 'core' Relief algorithm, i.e. examining strategies for neighbor selection and instance weighting, (2) improving scalability of the 'core' Relief algorithm to larger feature spaces through iterative approaches, (3) methods for flexibly adapting Relief to different data types, and (4) improving Relief run efficiency. Their strengths are that they are not dependent on heuristics, they run in low-order polynomial time, and they are noise-tolerant and robust to feature interactions, as well as being applicable for binary or continuous data; however, it does not discriminate between redundant features, and low numbers of training instances fool the algorithm. == Relief Algorithm == Take a data set with n instances of p features, belonging to two known classes. Within the data set, each feature should be scaled to the interval [0 1] (binary data should remain as 0 and 1). The algorithm will be repeated m times. Start with a p-long weight vector (W) of zeros. At each iteration, take the feature vector (X) belonging to one random instance, and the feature vectors of the instance closest to X (by Euclidean distance) from each class. The closest same-class instance is called 'near-hit', and the closest different-class instance is called 'near-miss'. Update the weight vector such that W i = W i − ( x i − n e a r H i t i ) 2 + ( x i − n e a r M i s s i ) 2 , {\displaystyle W_{i}=W_{i}-(x_{i}-\mathrm {nearHit} _{i})^{2}+(x_{i}-\mathrm {nearMiss} _{i})^{2},} where i {\displaystyle i} indexes the components and runs from 1 to p. Thus the weight of any given feature decreases if it differs from that feature in nearby instances of the same class more than nearby instances of the other class, and increases in the reverse case. After m iterations, divide each element of the weight vector by m. This becomes the relevance vector. Features are selected if their relevance is greater than a threshold τ. Kira and Rendell's experiments showed a clear contrast between relevant and irrelevant features, allowing τ to be determined by inspection. However, it can also be determined by Chebyshev's inequality for a given confidence level (α) that a τ of 1/sqrt(αm) is good enough to make the probability of a Type I error less than α, although it is stated that τ can be much smaller than that. Relief was also described as generalizable to multinomial classification by decomposition into a number of binary problems. == ReliefF Algorithm == Kononenko et al. propose a number of updates to Relief. Firstly, they find the near-hit and near-miss instances using the Manhattan (L1) norm rather than the Euclidean (L2) norm, although the rationale is not specified. Furthermore, they found taking the absolute differences between xi and near-hiti, and xi and near-missi to be sufficient when updating the weight vector (rather than the square of those differences). === Reliable probability estimation === Rather than repeating the algorithm m times, implement it exhaustively (i.e. n times, once for each instance) for relatively small n (up to one thousand). Furthermore, rather than finding the single nearest hit and single nearest miss, which may cause redundant and noisy attributes to affect the selection of the nearest neighbors, ReliefF searches for k nearest hits and misses and averages their contribution to the weights of each feature. k can be tuned for any individual problem. === Incomplete data === In ReliefF, the contribution of missing values to the feature weight is determined using the conditional probability that two values should be the same or different, approximated with relative frequencies from the data set. This can be calculated if one or both features are missing. === Multi-class problems === Rather than use Kira and Rendell's proposed decomposition of a multinomial classification into a number of binomial problems, ReliefF searches for k near misses from each different class and averages their contributions for updating W, weighted with the prior probability of each class. == Other Relief-based Algorithm Extensions/Derivatives == The following RBAs are arranged chronologically from oldest to most recent. They include methods for improving (1) the core Relief algorithm concept, (2) iterative approaches for scalability, (3) adaptations to different data types, (4) strategies for computational efficiency, or (5) some combination of these goals. For more on RBAs see these book chapters or this most recent review paper. === RRELIEFF === Robnik-Šikonja and Kononenko propose further updates to ReliefF, making it appropriate for regression. === Relieved-F === Introduced deterministic neighbor selection approach and a new approach for incomplete data handling. === Iterative Relief === Implemented method to address bias against non-monotonic features. Introduced the first iterative Relief approach. For the first time, neighbors were uniquely determined by a radius threshold and instances were weighted by their distance from the target instance. === I-RELIEF === Introduced sigmoidal weighting based on distance from target instance. All instance pairs (not just a defined subset of neighbors) contributed to score updates. Proposed an on-line learning variant of Relief. Extended the iterative Relief concept. Introduced local-learning updates between iterations for improved convergence. === TuRF (a.k.a. Tuned ReliefF) === Specifically sought to address noise in large feature spaces through the recursive elimination of features and the iterative application of ReliefF. === Evaporative Cooling ReliefF === Similarly seeking to address noise in large feature spaces. Utilized an iterative `evaporative' removal of lowest quality features using ReliefF scores in association with mutual information. === EReliefF (a.k.a. Extended ReliefF) === Addressing issues related to incomplete and multi-class data. === VLSReliefF (a.k.a. Very Large Scale ReliefF) === Dramatically improves the efficiency of detecting 2-way feature interactions in very large feature spaces by scoring random feature subsets rather than the entire feature space. === ReliefMSS === Introduced calculation of feature weights relative to average feature 'diff' between instance pairs. === SURF === SURF identifies nearest neighbors (both hits and misses) based on a distance threshold from the target instance defined by the average distance between all pairs of instances in the training data. Results suggest improved power to detect 2-way epistatic interactions over ReliefF. === SURF (a.k.a. SURFStar) === SURF extends the SURF algorithm to not only utilized 'near' neighbors in scoring updates, but 'far' instances as well, but employing inverted scoring updates for 'far instance pairs. Results suggest improved power to detect 2-way epistatic interactions over SURF, but an inability to detect simple main effects (i.e. univariate associations). === SWRF === SWRF extends the SURF algorithm adopting sigmoid weighting to take distance from the threshold into account. Also introduced a modular framework for further developing RBAs called MoRF. === MultiSURF (a.k.a. MultiSURFStar) === MultiSURF extends the SURF algorithm adapting the near/far neighborhood boundaries based on the average and standard deviation of distances from the target instance to all others. MultiSURF uses the standard deviation to define a dead-band zone where 'middle-distance' instances do not contribute to scoring. Evidence suggests MultiSURF performs best in detecting pure 2-way feature interactions. === Reli

    Read more →
  • Too Good To Go

    Too Good To Go

    Too Good To Go is a service with a mobile application that connects customers to restaurants and stores that have surplus unsold food. The service covers major European cities, and in October 2020 started operations in North America. As part of the initiatives taken on the International Day of Awareness of Food Loss and Waste to reduce food loss and waste, the app is suggested alongside OLIO among many others. In 2023 Too Good To Go was the fastest-growing sustainable food app startup by number of downloads. As of August 2023, it claimed 164,000 businesses, serving 62 million users, have saved 155 million bags of food. As of March 2023, it claimed to have saved over 200 million meals. == History == The company was created in 2015 in Denmark by Thomas Bjørn Momsen, Klaus Bagge Pedersen, Adam Sigbrand and Brian Christensen. In 2017, Mette Lykke (co-founder of Endomondo) joined as CEO. In February 2019, the company raised an additional 6 million euros in a new round of investment. In August 2019, Too Good To Go was re-launched in Austria. In September 2019, Too Good To Go acquired the Spanish startup weSAVEeat and merged it into its own brand. In November 2019, the offer of Too Good To Go extended to plants through a partnership with the French retail plants company Jardiland. In December 2019, Too Good To Go partnered with the French grocery retail stores Intermarché, and donated 60K euros to the French charity Restaurants du Cœur. In October 2021, Bonnie Wright teamed up with Too Good To Go to drive the initiative to reduce food waste. == Corporate affairs == The key trends for the Danish entity Too Good To Go ApS are (as of the financial year ending December 31): == International expansion == As of March 2026 the company serves the European countries Austria, Belgium, Czechia, Denmark, the Faroe Islands, France, Germany, Ireland, Italy, the Netherlands, Norway, Poland, Portugal, Spain, Sweden, Switzerland, the United Kingdom. Outside of Europe the service is available in Australia, Canada, Japan, New Zealand and the United States. == Purpose == The purpose of Too Good To Go is to reduce food waste worldwide. It developed a mobile application that connects restaurants and stores that have unsold, surplus food, with customers who can then buy whatever food the outlet considers surplus to requirements—without being able to choose—at a much lower price than normal. The food on the app is priced at one-third its original price. The company claims this reduces the waste of food that would otherwise be discarded; food waste is a global problem that affects the environment. In three years active, the app reached more than 9.5 million users. As of 2022, more than 57.7 million users and 154,000 establishments have signed up, and 139 million meals have been collected. In 2019, the company had 350 employees in Europe. As of June 2023 the company was estimated to have 1,289 employees. == Use == Food outlets must notify the TGTG company about what they have available on each day, stating what sort of food they have (baked foods, meals, produce, vegan food), and the price for a 'surprise bag', whose contents they determine; the user cannot choose, but the original prices will be three or more times the TGTG price. Notification is made early based upon the quantity predicted to be left over, not at the end of a selling period. Users must register to use the service. A mobile phone with an Internet connection running Android or iOS is needed. The user runs the TGTG app, which lists outlets available within a chosen distance and time range. The customer can then order and pay for a 'surprise bag'. The supplier can cancel an order at any time if the expected surplus is not available—the purchaser is notified by text message—and the purchaser can cancel with two hours' notice. The phone must be taken to the food supplier in a specified pickup time window, often 30 or 60 minutes long, and the transaction is finalised by swiping the app—connected to the Internet—to confirm collection.

    Read more →
  • SqueezeNet

    SqueezeNet

    SqueezeNet is a deep neural network for image classification released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to create a smaller neural network with fewer parameters while achieving competitive accuracy. Their best-performing model achieved the same accuracy as AlexNet on ImageNet classification, but has a size 510x less than it. == Version history == SqueezeNet was originally released on February 22, 2016. This original version of SqueezeNet was implemented on top of the Caffe deep learning software framework. Shortly thereafter, the open-source research community ported SqueezeNet to a number of other deep learning frameworks. On February 26, 2016, Eddie Bell released a port of SqueezeNet for the Chainer deep learning framework. On March 2, 2016, Guo Haria released a port of SqueezeNet for the Apache MXNet framework. On June 3, 2016, Tammy Yang released a port of SqueezeNet for the Keras framework. In 2017, companies including Baidu, Xilinx, Imagination Technologies, and Synopsys demonstrated SqueezeNet running on low-power processing platforms such as smartphones, FPGAs, and custom processors. As of 2018, SqueezeNet ships "natively" as part of the source code of a number of deep learning frameworks such as PyTorch, Apache MXNet, and Apple CoreML. In addition, third party developers have created implementations of SqueezeNet that are compatible with frameworks such as TensorFlow. Below is a summary of frameworks that support SqueezeNet. == Relationship to other networks == === AlexNet === SqueezeNet was originally described in SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. AlexNet is a deep neural network that has 240 MB of parameters, and SqueezeNet has just 5 MB of parameters. This small model size can more easily fit into computer memory and can more easily be transmitted over a computer network. However, it's important to note that SqueezeNet is not a "squeezed version of AlexNet." Rather, SqueezeNet is an entirely different DNN architecture than AlexNet. What SqueezeNet and AlexNet have in common is that both of them achieve approximately the same level of accuracy when evaluated on the ImageNet image classification validation dataset. === Model compression === Model compression (e.g. quantization and pruning of model parameters) can be applied to a deep neural network after it has been trained. In the SqueezeNet paper, the authors demonstrated that a model compression technique called Deep Compression can be applied to SqueezeNet to further reduce the size of the parameter file from 5 MB to 500 KB. Deep Compression has also been applied to other DNNs, such as AlexNet and VGG. == Variants == Some of the members of the original SqueezeNet team have continued to develop resource-efficient deep neural networks for a variety of applications. A few of these works are noted in the following table. As with the original SqueezeNet model, the open-source research community has ported and adapted these newer "squeeze"-family models for compatibility with multiple deep learning frameworks. In addition, the open-source research community has extended SqueezeNet to other applications, including semantic segmentation of images and style transfer.

    Read more →
  • Iris flower data set

    Iris flower data set

    The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish each species. Fisher's paper was published in the Annals of Eugenics (today the Annals of Human Genetics). == Use of the data set == Originally used as an example data set on which Fisher's linear discriminant analysis was applied, it became a typical test case for many statistical classification techniques in machine learning such as support vector machines. The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same. Nevertheless, all three species of Iris are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different Iris species belong to the different nodes. Only a small fraction of Iris-virginica is mixed with Iris-versicolor (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (Iris setosa, Iris virginica and Iris versicolor) are separable by the unsupervising procedures of nonlinear principal component analysis. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree. == Data set == The data set contains a set of 150 records under five attributes: sepal length, sepal width, petal length, petal width and species. The iris data set is widely used as a beginner's data set for machine learning purposes. The data set is included in R base and Python in the machine learning library scikit-learn, so that users can access it without having to find a source for it. Several versions of the data set have been published. === R code illustrating usage === The example R code shown below reproduce the scatterplot displayed at the top of this article: === Python code illustrating usage === This code gives:

    Read more →