Information gain (decision tree)

Information gain (decision tree)

In the context of decision trees in information theory and machine learning, information gain refers to the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one. (In broader contexts, information gain can also be used as a synonym for either Kullback–Leibler divergence or mutual information, but the focus of this article is on the more narrow meaning below.) Explicitly, the information gain of a random variable X {\displaystyle X} obtained from an observation of a random variable A {\displaystyle A} taking value a {\displaystyle a} is defined as: I G ( X , a ) = D KL ( P X ∣ a ∥ P X ) {\displaystyle {\mathit {IG}}(X,a)=D_{\text{KL}}{\bigl (}P_{X\mid a}\parallel P_{X}{\bigr )}} In other words, it is the Kullback–Leibler divergence of P X ( x ) {\displaystyle P_{X}(x)} (the prior distribution for X {\displaystyle X} ) from P X ∣ a ( x ) {\displaystyle P_{X\mid a}(x)} (the posterior distribution for X {\displaystyle X} given A = a {\displaystyle A=a} ). The expected value of the information gain is the mutual information I ( X ; A ) {\displaystyle I(X;A)} : E A ⁡ [ I G ( X , A ) ] = I ( X ; A ) {\displaystyle \operatorname {E} _{A}[{\mathit {IG}}(X,A)]=I(X;A)} i.e. the reduction in the entropy of X {\displaystyle X} achieved by learning the state of the random variable A {\displaystyle A} . In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree, and when applied in the area of machine learning is known as decision tree learning. Usually an attribute with high mutual information should be preferred to other attributes. == General definition == In general terms, the expected information gain is the reduction in information entropy Η from a prior state to a state that takes some information as given: I G ( T , a ) = H ( T ) − H ( T | a ) , {\displaystyle IG(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},} where H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} is the conditional entropy of T {\displaystyle T} given the value of attribute a {\displaystyle a} . This is intuitively plausible when interpreting entropy Η as a measure of uncertainty of a random variable T {\displaystyle T} : by learning (or assuming) a {\displaystyle a} about T {\displaystyle T} , our uncertainty about T {\displaystyle T} is reduced (i.e. I G ( T , a ) {\displaystyle IG(T,a)} is positive), unless of course T {\displaystyle T} is independent of a {\displaystyle a} , in which case H ( T | a ) = H ( T ) {\displaystyle \mathrm {H} (T|a)=\mathrm {H} (T)} , meaning I G ( T , a ) = 0 {\displaystyle IG(T,a)=0} . == Formal definition == Let T denote a set of training examples, each of the form ( x , y ) = ( x 1 , x 2 , x 3 , . . . , x k , y ) {\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)} where x a ∈ v a l s ( a ) {\displaystyle x_{a}\in \mathrm {vals} (a)} is the value of the a th {\displaystyle a^{\text{th}}} attribute or feature of example x {\displaystyle {\textbf {x}}} and y is the corresponding class label. The information gain for an attribute a is defined in terms of Shannon entropy H ( − ) {\displaystyle \mathrm {H} (-)} as follows. For a value v taken by attribute a, let S a ( v ) = { x ∈ T | x a = v } {\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}} be defined as the set of training inputs of T for which attribute a is equal to v. Then the information gain of T for attribute a is the difference between the a priori Shannon entropy H ( T ) {\displaystyle \mathrm {H} (T)} of the training set and the conditional entropy H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} . H ( T | a ) = ∑ v ∈ v a l s ( a ) | S a ( v ) | | T | ⋅ H ( S a ( v ) ) . {\displaystyle \mathrm {H} (T|a)=\sum _{v\in \mathrm {vals} (a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.} I G ( T , a ) = H ( T ) − H ( T | a ) {\displaystyle IG(T,a)=\mathrm {H} (T)-\mathrm {H} (T|a)} The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. In particular, the values v ∈ v a l s ( a ) {\displaystyle v\in vals(a)} defines a partition of the training set data T into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution P a ( v ) {\textstyle P_{a}{(v)}} on the values v ∈ v a l s ( a ) {\textstyle v\in vals(a)} of attribute a. The distribution is given P a ( v ) := | S a ( v ) | | T | {\textstyle P_{a}{(v)}:={\frac {|S_{a}{(v)}|}{|T|}}} . In this representation, the information gain of T given a can be defined as the difference between the unconditional Shannon entropy of T and the expected entropy of T conditioned on a, where the expectation value is taken with respect to the induced distribution on the values of a. I G ( T , a ) = H ( T ) − ∑ v ∈ v a l s ( a ) P a ( v ) H ( S a ( v ) ) = H ( T ) − E P a [ H ( S a ( v ) ) ] = H ( T ) − H ( T | a ) . {\displaystyle {\begin{alignedat}{2}IG(T,a)&=\mathrm {H} (T)-\sum _{v\in \mathrm {vals} (a)}{P_{a}{(v)}\mathrm {H} \left(S_{a}{(v)}\right)}\\&=\mathrm {H} (T)-\mathbb {E} _{P_{a}}{\left[\mathrm {H} {(S_{a}{(v)})}\right]}\\&=\mathrm {H} (T)-\mathrm {H} {(T|a)}.\end{alignedat}}} == Example == In engineering applications, information is analogous to signal, and entropy is analogous to noise. It determines how a decision tree chooses to split data. The leftmost figure below is very impure and has high entropy corresponding to higher disorder and lower information value. As we go to the right, the entropy decreases, and the information value increases. Now, it is clear that information gain is the measure of how much information a feature provides about a class. Let's visualize information gain in a decision tree as shown in the right: The node t is the parent node, and the sub-nodes tL and tR are child nodes. In this case, the parent node t has a collection of cancer and non-cancer samples denoted as C and NC respectively. We can use information gain to determine how good the splitting of nodes is in a decision tree. In terms of entropy, information gain is defined as: To understand this idea, let's start by an example in which we create a simple dataset and want to see if gene mutations could be related to patients with cancer. Given four different gene mutations, as well as seven samples, the training set for a decision can be created as follows: In this dataset, a 1 means the sample has the mutation (True), while a 0 means the sample does not (False). A sample with C denotes that it has been confirmed to be cancerous, while NC means it is non-cancerous. Using this data, a decision tree can be created with information gain used to determine the candidate splits for each node. For the next step, the entropy at parent node t of the above simple decision tree is computed as:H(t) = −[pC,t log2(pC,t) + pNC,t log2(pNC,t)] where, probability of selecting a class ‘C’ sample at node t, pC,t = n(t, C) / n(t), probability of selecting a class ‘NC’ sample at node t, pNC,t = n(t, NC) / n(t), n(t), n(t, C), and n(t, NC) are the number of total samples, ‘C’ samples and ‘NC’ samples at node t respectively.Using this with the example training set, the process for finding information gain beginning with H ( t ) {\displaystyle \mathrm {H} {(t)}} for Mutation 1 is as follows: pC, t = 4/7 pNC, t = 3/7 H ( t ) {\displaystyle \mathrm {H} {(t)}} = −(4/7 × log2(4/7) + 3/7 × log2(3/7)) = 0.985 Note: H ( t ) {\displaystyle \mathrm {H} {(t)}} will be the same for all mutations at the root. The relatively high value of entropy H ( t ) = 0.985 {\displaystyle \mathrm {H} {(t)}=0.985} (1 is the optimal value) suggests that the root node is highly impure and the constituents of the input at the root node would look like the leftmost figure in the above Entropy Diagram. However, such a set of data is good for learning the attributes of the mutations used to split the node. At a certain node, when the homogeneity of the constituents of the input occurs (as shown in the rightmost figure in the above Entropy Diagram), the dataset would no longer be good for learning. Moving on, the entropy at left and right child nodes of the above decision tree is computed using the formulae:H(tL) = −[pC,L log2(pC,L) + pNC,L log2(pNC,L)]H(tR) = −[pC,R log2(pC,R) + pNC,R log2(pNC,R)]where, probability of selecting a class ‘C’ sample at the left child node, pC,L = n(tL, C) / n(tL), probability of selecting a class ‘NC’ sample at the left child node, pNC,L = n(tL, NC) / n(tL), probability of selecting a class ‘C’ sample at the right child node, pC,R = n(tR, C) / n(tR), prob

Misskey

Misskey (Japanese: ミスキー, romanized: Misukī) is an open source, federated, social networking service created in 2014 by Japanese software engineer Eiji "syuilo" Shinoda. Misskey uses the ActivityPub protocol for federation, allowing users to interact between independent Misskey instances, and other ActivityPub compatible platforms. Misskey is generally considered to be part of the Fediverse. Despite being a decentralized service, Misskey is not philosophically opposed to centralization. The name Misskey comes from the lyrics of Brain Diver, a song by the Japanese singer May'n. == History == Misskey was initially developed as a BBS-style internet forum by high school student Eiji Shinoda in 2014. After introducing a timeline feature, Misskey gained popularity as the microblogging platform it is today. In 2018, Misskey added support for ActivityPub, becoming a federated social media platform. The flagship Misskey server, Misskey.io, was started on April 15, 2019. Misskey, alongside Mastodon and Bluesky, has received attention as a potential replacement for Twitter following Twitter's acquisition by Elon Musk in 2022. On April 8, 2023, Misskey.io incorporated as MisskeyHQ K.K. As of February 2024, over 450,000 users were registered, making it the largest instance of Misskey. Misskey.io is crowdfunded. The administrator of Misskey.io is Japanese system administrator Yoshiki Eto, who operates under the alias Murakami-san. Eiji Shinoda serves as director. In July 2023, Twitter introduced extreme restrictions on their API in order to combat scraping from bots. Some users were critical of the changes, and as a result migrated to other social networks. The number of users registering on Misskey.io, Misskey's official instance and the largest one, increased rapidly, with other Misskey instances also receiving a spike in signups. In response to this trend, Skeb, a platform for sharing art, announced on July 14, 2023 that it would sponsor the Misskey development team. In early 2024, Misskey was targeted by a spam attack from Japan. The cause of the attack is believed to be a dispute between rival groups on a Japanese hacker forum and a DDoS attack on a Discord bot. Mastodon instances with open registration were used in the attack. In November 2025, Eto announced intentions to replace ActivityPub with Misskey's own low-overhead federation system in "a few years". Shinoda later said that this was "fake news". == Development == Misskey is open source software and is licensed under the AGPLv3. The Misskey API is publicly available and is documented using the OpenAPI Specification, which allows users to build automated accounts and use it on any Misskey instance. The service is translated using Crowdin. Misskey is developed using Node.js. TypeScript is used on both the frontend and backend. PostgreSQL is used as its database. Vue.js is used for the frontend. == Functionality == Posts on Misskey are called "notes". Notes are limited to a maximum of 3,000 characters (a limit which can be customized by instances), and can be accompanied by any file, including polls, images, videos, and audio. Notes can be reposted, either by themselves or with another "quote" note. Misskey comes with multiple timelines to sort through the notes that an instance has available, and are displayed in reverse chronological order. The Home timeline shows notes from users that you follow, the Local timeline shows all notes from the instance in use, the Social timeline shows both the Home and Local timeline, and the Global timeline shows every public note that the instance knows about. Notes have customizable privacy settings to control what users can see a note, similar to Mastodon's post visibility ranges. Public notes show up on all timelines, while Home notes only show on a user's Home timeline. Notes can also be set to be available only for followers. Direct messages using notes can be sent to users.

Structural risk minimization

Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data. This principle was first set out in a 1974 book by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension. In practical terms, Structural Risk Minimization is implemented by minimizing E t r a i n + β H ( W ) {\displaystyle E_{train}+\beta H(W)} , where E t r a i n {\displaystyle E_{train}} is the train error, the function H ( W ) {\displaystyle H(W)} is called a regularization function, and β {\displaystyle \beta } is a constant. H ( W ) {\displaystyle H(W)} is chosen such that it takes large values on parameters W {\displaystyle W} that belong to high-capacity subsets of the parameter space. Minimizing H ( W ) {\displaystyle H(W)} in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error. The SRM problem can be formulated in terms of data. Given n data points consisting of data x and labels y, the objective J ( θ ) {\displaystyle J(\theta )} is often expressed in the following manner: J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x i ) − y i ) 2 + λ 2 ∑ j = 1 d θ j 2 {\displaystyle J(\theta )={\frac {1}{2n}}\sum _{i=1}^{n}(h_{\theta }(x^{i})-y^{i})^{2}+{\frac {\lambda }{2}}\sum _{j=1}^{d}\theta _{j}^{2}} The first term is the mean squared error (MSE) term between the value of the learned model, h θ {\displaystyle h_{\theta }} , and the given labels y {\displaystyle y} . This term is the training error, E t r a i n {\displaystyle E_{train}} , that was discussed earlier. The second term, places a prior over the weights, to favor sparsity and penalize larger weights. The trade-off coefficient, λ {\displaystyle \lambda } , is a hyperparameter that places more or less importance on the regularization term. Larger λ {\displaystyle \lambda } encourages sparser weights at the expense of a more optimal MSE, and smaller λ {\displaystyle \lambda } relaxes regularization allowing the model to fit to data. Note that as λ → ∞ {\displaystyle \lambda \to \infty } the weights become zero, and as λ → 0 {\displaystyle \lambda \to 0} , the model typically suffers from overfitting.

POP-11

POP-11 is a reflective, incrementally compiled programming language with many of the features of an interpreted language. It is the core language of the Poplog programming environment developed originally by the University of Sussex, and recently in the School of Computer Science at the University of Birmingham, which hosts the main Poplog website. POP-11 is an evolution of the language POP-2, developed in Edinburgh University, and features an open stack model (like Forth, among others). It is mainly procedural, but supports declarative language constructs, including a pattern matcher, and is mostly used for research and teaching in artificial intelligence, although it has features sufficient for many other classes of problems. It is often used to introduce symbolic programming techniques to programmers of more conventional languages like Pascal, who find POP syntax more familiar than that of Lisp. One of POP-11's features is that it supports first-class functions. POP-11 is the core language of the Poplog system. The availability of the compiler and compiler subroutines at run-time (a requirement for incremental compiling) gives it the ability to support a far wider range of extensions (including run-time extensions, such as adding new data-types) than would be possible using only a macro facility. This made it possible for (optional) incremental compilers to be added for Prolog, Common Lisp and Standard ML, which could be added as required to support either mixed language development or development in the second language without using any POP-11 constructs. This made it possible for Poplog to be used by teachers, researchers, and developers who were interested in only one of the languages. The most successful product developed in POP-11 was the Clementine data mining system, developed by ISL. After SPSS bought ISL, they renamed Clementine to SPSS Modeler and decided to port it to C++ and Java, and eventually succeeded with great effort, and perhaps some loss of the flexibility provided by the use of an AI language. POP-11 was for a time available only as part of an expensive commercial package (Poplog), but since about 1999 it has been freely available as part of the open-source software version of Poplog, including various added packages and teaching libraries. An online version of ELIZA using POP-11 is available at Birmingham. At the University of Sussex, David Young used POP-11 in combination with C and Fortran to develop a suite of teaching and interactive development tools for image processing and vision, and has made them available in the Popvision extension to Poplog. == Simple code examples == Here is an example of a simple POP-11 program: define Double(Source) -> Result; Source2 -> Result; enddefine; Double(123) => That prints out: 246 This one includes some list processing: define RemoveElementsMatching(Element, Source) -> Result; lvars Index; [[% for Index in Source do unless Index = Element or Index matches Element then Index; endunless; endfor; %]] -> Result; enddefine; RemoveElementsMatching("the", [[the cat sat on the mat]]) => ;;; outputs [[cat sat on mat]] RemoveElementsMatching("the", [[the cat] [sat on] the mat]) => ;;; outputs [[the cat] [sat on] mat] RemoveElementsMatching([[= cat]], [[the cat]] is a [[big cat]]) => ;;; outputs [[is a]] Examples using the POP-11 pattern matcher, which makes it relatively easy for students to learn to develop sophisticated list-processing programs without having to treat patterns as tree structures accessed by 'head' and 'tail' functions (CAR and CDR in Lisp), can be found in the online introductory tutorial. The matcher is at the heart of the SimAgent (sim_agent) toolkit. Some of the powerful features of the toolkit, such as linking pattern variables to inline code variables, would have been very difficult to implement without the incremental compiler facilities.

Hybrid intelligent system

Hybrid intelligent system denotes a software system which employs, in parallel, a combination of methods and techniques from artificial intelligence subfields, such as: Neuro-symbolic systems Neuro-fuzzy systems Hybrid connectionist-symbolic models Fuzzy expert systems Connectionist expert systems Evolutionary neural networks Genetic fuzzy systems Rough fuzzy hybridization Reinforcement learning with fuzzy, neural, or evolutionary methods as well as symbolic reasoning methods. From the cognitive science perspective, every natural intelligent system is hybrid because it performs mental operations on both the symbolic and subsymbolic levels. For the past few years, there has been an increasing discussion of the importance of A.I. Systems Integration. Based on notions that there have already been created simple and specific AI systems (such as systems for computer vision, speech synthesis, etc., or software that employs some of the models mentioned above) and now is the time for integration to create broad AI systems. Proponents of this approach are researchers such as Marvin Minsky, Ron Sun, Aaron Sloman, Angelo Dalli and Michael A. Arbib. An example hybrid is a hierarchical control system in which the lowest, reactive layers are sub-symbolic. The higher layers, having relaxed time constraints, are capable of reasoning from an abstract world model and performing planning (even by hybrid wisdom). Intelligent systems usually rely on hybrid reasoning processes, which include induction, deduction, abduction and reasoning by analogy.

2024 National Public Data breach

In August 2024, three class-action lawsuits were filed against National Public Data along with over 14 complaints filed in federal court, claiming that the company permitted hackers to steal sensitive private information covering millions of individuals. The theft was alleged to have occurred in April 2024. One of the lawsuits specifically claims that in April, a hacker going by the moniker "USDoD" posted a notice on the dark web, offering the data for sale at the price of US$3.5 million. The information stolen is alleged to include 2.9 billion records containing full names, current and past addresses, Social Security numbers, dates of birth, and telephone numbers. The stolen data contains records for people in the US, UK, and Canada. National Public Data confirmed on August 16, 2024, there was a breach originating from someone trying to breach their systems since December 2023, with the breach occurring from April 2024 and over the next few months. The company also confirmed that 2.9 billion records were obtained, though they were still working to determine how many people were affected by the breach, and were working with law enforcement to identify the hacker. == Jerico Pictures == Jerico Pictures, Inc., doing business as National Public Data, was a data broker company that performed employee background checks. Their primary service was collecting information from public data sources, including criminal records, addresses, and employment history, and offering that information for sale. On October 2, 2024, Jerico Pictures filed for Chapter 11 bankruptcy as it currently faces over a dozen lawsuits over the breach, and is potentially liable "for credit monitoring for hundreds of millions of potentially impacted individuals." In December 2024, National Public Data shut down, showing a closure notice on its website.

Algorithm selection

Algorithm selection (sometimes also called per-instance algorithm selection or offline algorithm selection) is a meta-algorithmic technique to choose an algorithm from a portfolio on an instance-by-instance basis. It is motivated by the observation that on many practical problems, different algorithms have different performance characteristics. That is, while one algorithm performs well in some scenarios, it performs poorly in others and vice versa for another algorithm. If we can identify when to use which algorithm, we can optimize for each scenario and improve overall performance. This is what algorithm selection aims to do. The only prerequisite for applying algorithm selection techniques is that there exists (or that there can be constructed) a set of complementary algorithms. == Definition == Given a portfolio P {\displaystyle {\mathcal {P}}} of algorithms A ∈ P {\displaystyle {\mathcal {A}}\in {\mathcal {P}}} , a set of instances i ∈ I {\displaystyle i\in {\mathcal {I}}} and a cost metric m : P × I → R {\displaystyle m:{\mathcal {P}}\times {\mathcal {I}}\to \mathbb {R} } , the algorithm selection problem consists of finding a mapping s : I → P {\displaystyle s:{\mathcal {I}}\to {\mathcal {P}}} from instances I {\displaystyle {\mathcal {I}}} to algorithms P {\displaystyle {\mathcal {P}}} such that the cost ∑ i ∈ I m ( s ( i ) , i ) {\displaystyle \sum _{i\in {\mathcal {I}}}m(s(i),i)} across all instances is optimized. == Examples == === Boolean satisfiability problem (and other hard combinatorial problems) === A well-known application of algorithm selection is the Boolean satisfiability problem. Here, the portfolio of algorithms is a set of (complementary) SAT solvers, the instances are Boolean formulas, the cost metric is for example average runtime or number of unsolved instances. So, the goal is to select a well-performing SAT solver for each individual instance. In the same way, algorithm selection can be applied to many other N P {\displaystyle {\mathcal {NP}}} -hard problems (such as mixed integer programming, CSP, AI planning, TSP, MAXSAT, QBF and answer set programming). Competition-winning systems in SAT are SATzilla, 3S and CSHC === Machine learning === In machine learning, algorithm selection is better known as meta-learning. The portfolio of algorithms consists of machine learning algorithms (e.g., Random Forest, SVM, DNN), the instances are data sets and the cost metric is for example the error rate. So, the goal is to predict which machine learning algorithm will have a small error on each data set. == Instance features == The algorithm selection problem is mainly solved with machine learning techniques. By representing the problem instances by numerical features f {\displaystyle f} , algorithm selection can be seen as a multi-class classification problem by learning a mapping f i ↦ A {\displaystyle f_{i}\mapsto {\mathcal {A}}} for a given instance i {\displaystyle i} . Instance features are numerical representations of instances. For example, we can count the number of variables, clauses, average clause length for Boolean formulas, or number of samples, features, class balance for ML data sets to get an impression about their characteristics. === Static vs. probing features === We distinguish between two kinds of features: Static features are in most cases some counts and statistics (e.g., clauses-to-variables ratio in SAT). These features ranges from very cheap features (e.g. number of variables) to very complex features (e.g., statistics about variable-clause graphs). Probing features (sometimes also called landmarking features) are computed by running some analysis of algorithm behavior on an instance (e.g., accuracy of a cheap decision tree algorithm on an ML data set, or running for a short time a stochastic local search solver on a Boolean formula). These feature often cost more than simple static features. === Feature costs === Depending on the used performance metric m {\displaystyle m} , feature computation can be associated with costs. For example, if we use running time as performance metric, we include the time to compute our instance features into the performance of an algorithm selection system. SAT solving is a concrete example, where such feature costs cannot be neglected, since instance features for CNF formulas can be either very cheap (e.g., to get the number of variables can be done in constant time for CNFs in the DIMACs format) or very expensive (e.g., graph features which can cost tens or hundreds of seconds). It is important to take the overhead of feature computation into account in practice in such scenarios; otherwise a misleading impression of the performance of the algorithm selection approach is created. For example, if the decision which algorithm to choose can be made with perfect accuracy, but the features are the running time of the portfolio algorithms, there is no benefit to the portfolio approach. This would not be obvious if feature costs were omitted. == Approaches == === Regression approach === One of the first successful algorithm selection approaches predicted the performance of each algorithm m ^ A : I → R {\displaystyle {\hat {m}}_{\mathcal {A}}:{\mathcal {I}}\to \mathbb {R} } and selected the algorithm with the best predicted performance a r g min A ∈ P m ^ A ( i ) {\displaystyle arg\min _{{\mathcal {A}}\in {\mathcal {P}}}{\hat {m}}_{\mathcal {A}}(i)} for an instance i {\displaystyle i} . === Clustering approach === A common assumption is that the given set of instances I {\displaystyle {\mathcal {I}}} can be clustered into homogeneous subsets and for each of these subsets, there is one well-performing algorithm for all instances in there. So, the training consists of identifying the homogeneous clusters via an unsupervised clustering approach and associating an algorithm with each cluster. A new instance is assigned to a cluster and the associated algorithm selected. A more modern approach is cost-sensitive hierarchical clustering using supervised learning to identify the homogeneous instance subsets. === Pairwise cost-sensitive classification approach === A common approach for multi-class classification is to learn pairwise models between every pair of classes (here algorithms) and choose the class that was predicted most often by the pairwise models. We can weight the instances of the pairwise prediction problem by the performance difference between the two algorithms. This is motivated by the fact that we care most about getting predictions with large differences correct, but the penalty for an incorrect prediction is small if there is almost no performance difference. Therefore, each instance i {\displaystyle i} for training a classification model A 1 {\displaystyle {\mathcal {A}}_{1}} vs A 2 {\displaystyle {\mathcal {A}}_{2}} is associated with a cost | m ( A 1 , i ) − m ( A 2 , i ) | {\displaystyle |m({\mathcal {A}}_{1},i)-m({\mathcal {A}}_{2},i)|} . == Requirements == The algorithm selection problem can be effectively applied under the following assumptions: The portfolio P {\displaystyle {\mathcal {P}}} of algorithms is complementary with respect to the instance set I {\displaystyle {\mathcal {I}}} , i.e., there is no single algorithm A ∈ P {\displaystyle {\mathcal {A}}\in {\mathcal {P}}} that dominates the performance of all other algorithms over I {\displaystyle {\mathcal {I}}} (see figures to the right for examples on complementary analysis). In some application, the computation of instance features is associated with a cost. For example, if the cost metric is running time, we have also to consider the time to compute the instance features. In such cases, the cost to compute features should not be larger than the performance gain through algorithm selection. == Application domains == Algorithm selection is not limited to single domains but can be applied to any kind of algorithm if the above requirements are satisfied. Application domains include: hard combinatorial problems: SAT, Mixed Integer Programming, CSP, AI Planning, TSP, MAXSAT, QBF and Answer Set Programming combinatorial auctions in machine learning, the problem is known as meta-learning software design black-box optimization multi-agent systems numerical optimization linear algebra, differential equations evolutionary algorithms vehicle routing problem power systems For an extensive list of literature about algorithm selection, we refer to a literature overview. == Variants of algorithm selection == === Online selection === Online algorithm selection refers to switching between different algorithms during the solving process. This is useful as a hyper-heuristic. In contrast, offline algorithm selection selects an algorithm for a given instance only once and before the solving process. === Computation of schedules === An extension of algorithm selection is the per-instance algorithm scheduling problem, in which we do not select only one solver, but we select a time budget for each algorithm