AI For Young Learners Pdf

AI For Young Learners Pdf — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Trazzler

    Trazzler

    Trazzler is a travel destination app that specializes in unique and local destinations. The initial concept was developed by Adam Rugel and Biz Stone in 2006 at Twitter's original offices under the name "71 miles". More than 10,000 writers and photographers have contributed and more than $350,000 in freelance contracts have been issued as a result of Trazzeler's weekly writing and photography contests. Investors in the company include SV Angel, AOL Founder Steve Case, and the Twitter founders, Evan Williams, Jack Dorsey, and Biz Stone. The company's partners are the City of Chicago, Hawaii Tourism Authority, Fairmont Hotels & Resorts, Salon.com, and Air New Zealand. Trazzler is designed for use on the iOS, Android, and Facebook.

    Read more →
  • Multinomial logistic regression

    Multinomial logistic regression

    In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model. == Background == Multinomial logistic regression is used when the dependent variable in question is nominal (equivalently categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be: Which major will a college student choose, given their grades, stated likes and dislikes, etc.? Which blood type does a person have, given the results of various diagnostic tests? In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal? Which candidate will a person vote for, given particular demographic characteristics? Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries? These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken). == Assumptions == The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of K alternatives to be modeled as a set of K − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus. If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA. == Model == === Introduction === There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product: score ⁡ ( X i , k ) = β k ⋅ X i , {\displaystyle \operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},} where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the probability of observation i choosing outcome k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue. === Setup === The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the logistic regression article. ==== Data points ==== Specifically, it is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (also known as independent variables, predictor variables, features, etc.), and an associated categorical outcome Yi (also known as dependent variable, response variable), which can take on one of K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so tha

    Read more →
  • Aleph (ILP)

    Aleph (ILP)

    Aleph (A Learning Engine for Proposing Hypotheses) is an inductive logic programming system introduced by Ashwin Srinivasan in 2001. As of 2022 it is still one of the most widely used inductive logic programming systems. It is based on the earlier system Progol. == Learning task == The input to Aleph is background knowledge, specified as a logic program, a language bias in the form of mode declarations, as well as positive and negative examples specified as ground facts. As output it returns a logic program which, together with the background knowledge, entails all of the positive examples and none of the negative examples. == Basic algorithm == Starting with an empty hypothesis, Aleph proceeds as follows: It chooses a positive example to generalise; if none are left, it aborts and outputs the current hypothesis. Then it constructs the bottom clause, that is, the most specific clause that is allowed by the mode declarations and covers the example. It then searches for a generalisation of the bottom clause that scores better on the chosen metric. It then adds the new clause to the hypothesis program and removes all examples that are covered by the new clause. == Search algorithm == Aleph searches for clauses in a top-down manner, using the bottom clause constructed in the preceding step to bound the search from below. It searches the refinement graph in a breadth-first manner, with tunable parameters to bound the maximal clause size and proof depth. It scores each clause using one of 13 different evaluation metrics, as chosen in advance by the user.

    Read more →
  • Bayesian hierarchical modeling

    Bayesian hierarchical modeling

    Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the posterior distribution of model parameters using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. This integration enables calculation of updated posterior over the (hyper)parameters, effectively updating prior beliefs in light of the observed data. Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results are not technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors. Hierarchical modeling, as its name implies, retains nested data structure, and is used when information is available at several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own time-based profile of daily infected cases. In decline curve analysis to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own time-based profile of oil or gas production rates (usually, barrels per month). Hierarchical modeling is used to devise computation based strategies for multiparameter problems. == Philosophy == Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, "The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality." These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event. == Bayes' theorem == The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability θ j {\displaystyle \theta _{j}} , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients. In order to make updated probability statements about θ j {\displaystyle \theta _{j}} , given the occurrence of event y, we must begin with a model providing a joint probability distribution for θ j {\displaystyle \theta _{j}} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution P ( θ ) {\displaystyle P(\theta )} and the sampling distribution P ( y ∣ θ ) {\displaystyle P(y\mid \theta )} respectively: P ( θ , y ) = P ( θ ) P ( y ∣ θ ) {\displaystyle P(\theta ,y)=P(\theta )P(y\mid \theta )} Using the basic property of conditional probability, the posterior distribution will yield: P ( θ ∣ y ) = P ( θ , y ) P ( y ) = P ( y ∣ θ ) P ( θ ) P ( y ) {\displaystyle P(\theta \mid y)={\frac {P(\theta ,y)}{P(y)}}={\frac {P(y\mid \theta )P(\theta )}{P(y)}}} This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability, P ( θ ∣ y ) {\displaystyle P(\theta \mid y)} , relative to solvable subsets of its supportive evidence. == Exchangeability == The usual starting point of a statistical analysis is the assumption that the n values y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} are exchangeable. If no information – other than data y – is available to distinguish any of the θ j {\displaystyle \theta _{j}} 's from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector θ {\displaystyle \theta } , with distribution P ( θ ) {\displaystyle P(\theta )} . === Finite exchangeability === For a fixed number n, the set y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable if the joint probability P ( y 1 , y 2 , … , y n ) {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})} is invariant under permutations of the indices. That is, for every permutation π {\displaystyle \pi } or ( π 1 , π 2 , … , π n ) {\displaystyle (\pi _{1},\pi _{2},\ldots ,\pi _{n})} of (1, 2, …, n), P ( y 1 , y 2 , … , y n ) = P ( y π 1 , y π 2 , … , y π n ) . {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})=P(y_{\pi _{1}},y_{\pi _{2}},\ldots ,y_{\pi _{n}}).} The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability 1 2 {\displaystyle {\frac {1}{2}}} of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n {\displaystyle n} balls, there will be n − 1 {\displaystyle n-1} remaining balls left for the next draw. Let Y i = { 1 , if the i th ball is red , 0 , otherwise . {\displaystyle {\text{Let }}Y_{i}={\begin{cases}1,&{\text{if the }}i{\text{th ball is red}},\\0,&{\text{otherwise}}.\end{cases}}} The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2: P ( y 1 = 1 , y 2 = 0 ) = P ( y 1 = 0 , y 2 = 1 ) = 1 2 {\displaystyle P(y_{1}=1,y_{2}=0)=P(y_{1}=0,y_{2}=1)={\frac {1}{2}}} . This makes y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} exchangeable. But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2: P ( y 2 = 1 ∣ y 1 = 1 ) = 0 ≠ P ( y 2 = 1 ) = 1 2 {\displaystyle P(y_{2}=1\mid y_{1}=1)=0\neq P(y_{2}=1)={\frac {1}{2}}} . Thus, y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} are not independent. If x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. === Infinite exchangeability === Infinite exchangeability is the property that every finite subset of an infinite sequence y 1 {\displaystyle y_{1}} , y 2 , … {\displaystyle y_{2},\ldots } is exchangeable. For any n, the sequence y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable. == Hierarchical models == === Components === Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely: Hyperparameters: parameters of the prior distribution Hyperpriors: distributions of Hyperparameters Suppose a random variable Y follows a normal distribution with parameter θ {\displaystyle \theta } as the mean and 1 as the variance, that is Y ∣ θ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta \sim N(\theta ,1)} . The tilde relation ∼ {\displaystyle \sim } can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter θ {\displaystyle \theta } has a distribution given by a normal distribution with mean μ {\displaystyle \mu } and variance 1, i.e. θ ∣ μ ∼ N ( μ , 1 ) {\displaystyle \theta \mid \mu \sim N(\mu ,1)} . Furthermore, μ {\displaystyle \mu } follows another distribution given, for example, by the standard normal distribution, N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} . The parameter μ {\dis

    Read more →
  • .ai

    .ai

    .ai is the Internet country code top-level domain (ccTLD) for Anguilla, a British Overseas Territory in the Caribbean. It is administered by the government of Anguilla. It is a popular domain hack with companies and projects related to the artificial intelligence industry (AI). Google's ad targeting treats .ai as a generic top-level domain (gTLD) because "users and website owners frequently see [the domain] as being more generic than country-targeted." In 2021, Google Search analyst Gary Illyes announced that ".ai" had been added to Google’s list of generic country-code top-level domains, meaning that Google would no longer infer Anguilla-specific targeting from the ccTLD. Identity Digital began managing the domain as of January 2025. == Second and third level registrations == Registrations within off.ai, com.ai, net.ai, and org.ai are available worldwide without restriction. From 15 September 2009, second level registrations within .ai are available to everyone worldwide. == Registration == The minimum registration term allowed for .ai domains is 2 through 10 years for registration and renewal, and a 2-year renewal for domain transfer. Identity Digital is the authority in charge of managing this extension. Registrations began on 16 February 1995. The limits on the number of characters used for the domain name are, at a minimum, from 1 to 3, depending on the registrar, and always at most 63 characters. The character set supported for .ai domain names includes A–Z, a–z, 0–9, and hyphen. As of November 2022, .ai domains cannot accommodate IDN characters. There are no requirements for registering a domain, including local and foreign residents. A .ai domain can be suspended or revoked, if the domain is involved in illegal activity such as violating trademarks or copyrights. Usage must not violate the laws of Anguilla. Anguilla uses the UDRP. Filing a UDRP challenge requires using one of the ICANN Approved Dispute Resolution Service Providers. If the domain is with an ICANN accredited registrar, they should work with the arbitrator. Usually this means either doing nothing or transferring a domain. .ai domains are transferable to any desired registrars as the registration of domain is done maintaining EPP. There used to be a whois.ai-based platform of expired domains in which those could be procured and auctioned every ten days through a standard online process. The last auctions of such kind closed there in December 2024; the platform had been scheduled for shutdown on 30 June 2025, but remained online in the months following that date. == Valuation == Domains cost depends on the registrar, with yearly fees ranging from US$140 (the base fee, as established by Anguilla) to $200. As of July 2025, the highest-valued .ai domain is an undisclosed one sold on 8 November 2023, on Escrow.com, for US$1,500,000—months after an initial $300,000 sale to the same buyer. Among the publicly disclosed ones, the most valued, fin.ai, was sold for $1,000,000 in March 2025. On 16 December 2017, the .ai registry started supporting the Extensible Provisioning Protocol (EPP) and migrated all of its domains onto an EPP system. Consequently, many registrars are allowed to sell .ai domains. Since that date, the .ai ccTLD has also been popular with artificial intelligence companies and organisations. Though such trends are primarily seen among new AI based companies or startups, many established AI and Tech companies preferred not to opt for .ai domains. For example, DeepMind has its domain retained at .com; Meta has redirected its facebook.ai domain to ai.meta.com. == Impact on Anguilla's economy == The registration fees earned from the .ai domains go to the treasury of the Government of Anguilla. As per a 2018 New York Times report, the total revenue generated out of selling .ai domains was $2.9 million. In 2023, Anguilla's government made about US$32 million from fees collected for registering .ai domains; that amounted to over 10% of gross domestic product for the territory. "In the years before the real breakthrough of AI, revenue from .ai domains made up less than 1% of our state income, by 2025 it will be around 47%," explained Jose Vanterpool, Minister of Infrastructure and Communications (MICUHITES), in an interview with BBC. The high 90% renewal rate of .ai domains and the 2025 renewal wave of domains registered in 2023 are driving another surge in state revenues, according to Domaintechnik.

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Dynamic time warping

    Dynamic time warping

    In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules: Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match) The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match) The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j > i {\displaystyle j>i} are indices from the first sequence, then there must not be two indices l > k {\displaystyle l>k} in the other sequence, such that index i {\displaystyle i} is matched with index l {\displaystyle l} and index j {\displaystyle j} is matched with index k {\displaystyle k} , and vice versa We can plot each match between the sequences 1 : M {\displaystyle 1:M} and 1 : N {\displaystyle 1:N} as a path in a M × N {\displaystyle M\times N} matrix from ( 1 , 1 ) {\displaystyle (1,1)} to ( M , N ) {\displaystyle (M,N)} , such that each step is one of ( 0 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) {\displaystyle (0,1),(1,0),(1,1)} . In this formulation, we see that the number of possible matches is the Delannoy number. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values. The sequences are "warped" non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. This sequence alignment method is often used in time series classification. Although DTW measures a distance-like quantity between two given sequences, it doesn't guarantee the triangle inequality to hold. In addition to a similarity measure between the two sequences (a so called "warping path" is produced), by warping according to this path the two signals may be aligned in time. The signal with an original set of points X(original), Y(original) is transformed to X(warped), Y(warped). This finds applications in genetic sequence and audio synchronisation. In a related technique sequences of varying speed may be averaged using this technique see the average sequence section. This is conceptually very similar to the Needleman–Wunsch algorithm. == Implementation == This example illustrates the implementation of the dynamic time warping algorithm when the two sequences s and t are strings of discrete symbols. For two symbols x and y, d ( x , y ) {\displaystyle d(x,y)} is a distance between the symbols, e.g., d ( x , y ) = | x − y | {\displaystyle d(x,y)=|x-y|} . int DTWDistance(s: array [1..n], t: array [1..m]) { DTW := array [0..n, 0..m] for i := 0 to n for j := 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := 1 to m cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } where DTW[i, j] is the distance between s[1:i] and t[1:j] with the best alignment. We sometimes want to add a locality constraint. That is, we require that if s[i] is matched with t[j], then | i − j | {\displaystyle |i-j|} is no larger than w, a window parameter. We can easily modify the above algorithm to add a locality constraint (differences marked). However, the above given modification works only if | n − m | {\displaystyle |n-m|} is no larger than w, i.e. the end point is within the window length from diagonal. In order to make the algorithm work, the window parameter w must be adapted so that | n − m | ≤ w {\displaystyle |n-m|\leq w} (see the line marked with () in the code). int DTWDistance(s: array [1..n], t: array [1..m], w: int) { DTW := array [0..n, 0..m] w := max(w, abs(n-m)) // adapt window size () for i := 0 to n for j:= 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) DTW[i, j] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } == Warping properties == The DTW algorithm produces a discrete matching between existing elements of one series to another. In other words, it does not allow time-scaling of segments within the sequence. Other methods allow continuous warping. For example, Correlation Optimized Warping (COW) divides the sequence into uniform segments that are scaled in time using linear interpolation, to produce the best matching warping. The segment scaling causes potential creation of new elements, by time-scaling segments either down or up, and thus produces a more sensitive warping than DTW's discrete matching of raw elements. == Complexity == The time complexity of the DTW algorithm is O ( N M ) {\displaystyle O(NM)} , where N {\displaystyle N} and M {\displaystyle M} are the lengths of the two input sequences. The 50 years old quadratic time bound was broken in 2016: an algorithm due to Gold and Sharir enables computing DTW in O ( N 2 / log ⁡ log ⁡ N ) {\displaystyle O({N^{2}}/\log \log N)} time and space for two input sequences of length N {\displaystyle N} . This algorithm can also be adapted to sequences of different lengths. Despite this improvement, it was shown that a strongly subquadratic running time of the form O ( N 2 − ϵ ) {\displaystyle O(N^{2-\epsilon })} for some ϵ > 0 {\displaystyle \epsilon >0} cannot exist unless the Strong exponential time hypothesis fails. While the dynamic programming algorithm for DTW requires O ( N M ) {\displaystyle O(NM)} space in a naive implementation, the space consumption can be reduced to O ( min ( N , M ) ) {\displaystyle O(\min(N,M))} using Hirschberg's algorithm. == Fast computation == Fast techniques for computing DTW include PrunedDTW, SparseDTW, FastDTW, and the MultiscaleDTW. A common task, retrieval of similar time series, can be accelerated by using lower bounds such as LB_Keogh, LB_Improved, or LB_Petitjean. However, the Early Abandon and Pruned DTW algorithm reduces the degree of acceleration that lower bounding provides and sometimes renders it ineffective. In a survey, Wang et al. reported slightly better results with the LB_Improved lower bound than the LB_Keogh bound, and found that other techniques were inefficient. Subsequent to this survey, the LB_Enhanced bound was developed that is always tighter than LB_Keogh while also being more efficient to compute. LB_Petitjean is the tightest known lower bound that can be computed in linear time. == Average sequence == Averaging for dynamic time warping is the problem of finding an average sequence for a set of sequences. NLAAF is an exact method to average two sequences using DTW. For more than two sequences, the problem is related to that of multiple alignment and requires heuristics. DBA is currently a reference method to average a set of sequences consistently with DTW. COMASA efficiently randomizes the search for the average sequence, using DBA as a local optimization process. == Supervised learning == A nearest-neighbour classifier can achieve state-of-the-art performance when using dynamic time warping as a distance measure. == Amerced Dynamic Time Warping == Amerced Dynamic Time Warping (ADTW) is a variant of DTW designed to better control DTW's permissiveness in the alignments that it allows. The windows that classical DTW uses to constrain alignments introduce a step function. Any warping of the path is allowed within the window and none beyond it. In contrast, ADTW employs an additive penalty that is incurred each time that the path is warped. Any amount of warping is allowed, but each warping action incurs a direct penalty. ADTW significantly outperforms DTW with windowing when applied as a nearest neighbor classifier on a set of benchmark time series classification tasks. == Alternative approaches == In functional data analysis, time series are regarde

    Read more →
  • Dominance-based rough set approach

    Dominance-based rough set approach

    The dominance-based rough set approach (DRSA) is an extension of rough set theory for multi-criteria decision analysis (MCDA), introduced by Greco, Matarazzo and Słowiński. The main change compared to the classical rough sets is the substitution for the indiscernibility relation by a dominance relation, which permits one to deal with inconsistencies typical to consideration of criteria and preference-ordered decision classes. == Multicriteria classification (sorting) == Multicriteria classification (sorting) is one of the problems considered within MCDA and can be stated as follows: given a set of objects evaluated by a set of criteria (attributes with preference-order domains), assign these objects to some pre-defined and preference-ordered decision classes, such that each object is assigned to exactly one class. Due to the preference ordering, improvement of evaluations of an object on the criteria should not worsen its class assignment. The sorting problem is very similar to the problem of classification, however, in the latter, the objects are evaluated by regular attributes and the decision classes are not necessarily preference ordered. The problem of multicriteria classification is also referred to as ordinal classification problem with monotonicity constraints and often appears in real-life application when ordinal and monotone properties follow from the domain knowledge about the problem. As an illustrative example, consider the problem of evaluation in a high school. The director of the school wants to assign students (objects) to three classes: bad, medium and good (notice that class good is preferred to medium and medium is preferred to bad). Each student is described by three criteria: level in Physics, Mathematics and Literature, each taking one of three possible values bad, medium and good. Criteria are preference-ordered and improving the level from one of the subjects should not result in worse global evaluation (class). As a more serious example, consider classification of bank clients, from the viewpoint of bankruptcy risk, into classes safe and risky. This may involve such characteristics as "return on equity (ROE)", "return on investment (ROI)" and "return on sales (ROS)". The domains of these attributes are not simply ordered but involve a preference order since, from the viewpoint of bank managers, greater values of ROE, ROI or ROS are better for clients being analysed for bankruptcy risk . Thus, these attributes are criteria. Neglecting this information in knowledge discovery may lead to wrong conclusions. == Data representation == === Decision table === In DRSA, data are often presented using a particular form of decision table. Formally, a DRSA decision table is a 4-tuple S = ⟨ U , Q , V , f ⟩ {\displaystyle S=\langle U,Q,V,f\rangle } , where U {\displaystyle U\,\!} is a finite set of objects, Q {\displaystyle Q\,\!} is a finite set of criteria, V = ⋃ q ∈ Q V q {\displaystyle V=\bigcup {}_{q\in Q}V_{q}} where V q {\displaystyle V_{q}\,\!} is the domain of the criterion q {\displaystyle q\,\!} and f : U × Q → V {\displaystyle f\colon U\times Q\to V} is an information function such that f ( x , q ) ∈ V q {\displaystyle f(x,q)\in V_{q}} for every ( x , q ) ∈ U × Q {\displaystyle (x,q)\in U\times Q} . The set Q {\displaystyle Q\,\!} is divided into condition criteria (set C ≠ ∅ {\displaystyle C\neq \emptyset } ) and the decision criterion (class) d {\displaystyle d\,\!} . Notice, that f ( x , q ) {\displaystyle f(x,q)\,\!} is an evaluation of object x {\displaystyle x\,\!} on criterion q ∈ C {\displaystyle q\in C} , while f ( x , d ) {\displaystyle f(x,d)\,\!} is the class assignment (decision value) of the object. An example of decision table is shown in Table 1 below. === Outranking relation === It is assumed that the domain of a criterion q ∈ Q {\displaystyle q\in Q} is completely preordered by an outranking relation ⪰ q {\displaystyle \succeq _{q}} ; x ⪰ q y {\displaystyle x\succeq _{q}y} means that x {\displaystyle x\,\!} is at least as good as (outranks) y {\displaystyle y\,\!} with respect to the criterion q {\displaystyle q\,\!} . Without loss of generality, we assume that the domain of q {\displaystyle q\,\!} is a subset of reals, V q ⊆ R {\displaystyle V_{q}\subseteq \mathbb {R} } , and that the outranking relation is a simple order between real numbers ≥ {\displaystyle \geq \,\!} such that the following relation holds: x ⪰ q y ⟺ f ( x , q ) ≥ f ( y , q ) {\displaystyle x\succeq _{q}y\iff f(x,q)\geq f(y,q)} . This relation is straightforward for gain-type ("the more, the better") criterion, e.g. company profit. For cost-type ("the less, the better") criterion, e.g. product price, this relation can be satisfied by negating the values from V q {\displaystyle V_{q}\,\!} . === Decision classes and class unions === Let T = { 1 , … , n } {\displaystyle T=\{1,\ldots ,n\}\,\!} . The domain of decision criterion, V d {\displaystyle V_{d}\,\!} consist of n {\displaystyle n\,\!} elements (without loss of generality we assume V d = T {\displaystyle V_{d}=T\,\!} ) and induces a partition of U {\displaystyle U\,\!} into n {\displaystyle n\,\!} classes Cl = { C l t , t ∈ T } {\displaystyle {\textbf {Cl}}=\{Cl_{t},t\in T\}} , where C l t = { x ∈ U : f ( x , d ) = t } {\displaystyle Cl_{t}=\{x\in U\colon f(x,d)=t\}} . Each object x ∈ U {\displaystyle x\in U} is assigned to one and only one class C l t , t ∈ T {\displaystyle Cl_{t},t\in T} . The classes are preference-ordered according to an increasing order of class indices, i.e. for all r , s ∈ T {\displaystyle r,s\in T} such that r ≥ s {\displaystyle r\geq s\,\!} , the objects from C l r {\displaystyle Cl_{r}\,\!} are strictly preferred to the objects from C l s {\displaystyle Cl_{s}\,\!} . For this reason, we can consider the upward and downward unions of classes, defined respectively, as: C l t ≥ = ⋃ s ≥ t C l s C l t ≤ = ⋃ s ≤ t C l s t ∈ T {\displaystyle Cl_{t}^{\geq }=\bigcup _{s\geq t}Cl_{s}\qquad Cl_{t}^{\leq }=\bigcup _{s\leq t}Cl_{s}\qquad t\in T} == Main concepts == === Dominance === We say that x {\displaystyle x\,\!} dominates y {\displaystyle y\,\!} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted by x D p y {\displaystyle xD_{p}y\,\!} , if x {\displaystyle x\,\!} is better than y {\displaystyle y\,\!} on every criterion from P {\displaystyle P\,\!} , x ⪰ q y , ∀ q ∈ P {\displaystyle x\succeq _{q}y,\,\forall q\in P} . For each P ⊆ C {\displaystyle P\subseteq C} , the dominance relation D P {\displaystyle D_{P}\,\!} is reflexive and transitive, i.e. it is a partial pre-order. Given P ⊆ C {\displaystyle P\subseteq C} and x ∈ U {\displaystyle x\in U} , let D P + ( x ) = { y ∈ U : y D p x } {\displaystyle D_{P}^{+}(x)=\{y\in U\colon yD_{p}x\}} D P − ( x ) = { y ∈ U : x D p y } {\displaystyle D_{P}^{-}(x)=\{y\in U\colon xD_{p}y\}} represent P-dominating set and P-dominated set with respect to x ∈ U {\displaystyle x\in U} , respectively. === Rough approximations === The key idea of the rough set philosophy is approximation of one knowledge by another knowledge. In DRSA, the knowledge being approximated is a collection of upward and downward unions of decision classes and the "granules of knowledge" used for approximation are P-dominating and P-dominated sets. The P-lower and the P-upper approximation of C l t ≥ , t ∈ T {\displaystyle Cl_{t}^{\geq },t\in T} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted as P _ ( C l t ≥ ) {\displaystyle {\underline {P}}(Cl_{t}^{\geq })} and P ¯ ( C l t ≥ ) {\displaystyle {\overline {P}}(Cl_{t}^{\geq })} , respectively, are defined as: P _ ( C l t ≥ ) = { x ∈ U : D P + ( x ) ⊆ C l t ≥ } {\displaystyle {\underline {P}}(Cl_{t}^{\geq })=\{x\in U\colon D_{P}^{+}(x)\subseteq Cl_{t}^{\geq }\}} P ¯ ( C l t ≥ ) = { x ∈ U : D P − ( x ) ∩ C l t ≥ ≠ ∅ } {\displaystyle {\overline {P}}(Cl_{t}^{\geq })=\{x\in U\colon D_{P}^{-}(x)\cap Cl_{t}^{\geq }\neq \emptyset \}} Analogously, the P-lower and the P-upper approximation of C l t ≤ , t ∈ T {\displaystyle Cl_{t}^{\leq },t\in T} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted as P _ ( C l t ≤ ) {\displaystyle {\underline {P}}(Cl_{t}^{\leq })} and P ¯ ( C l t ≤ ) {\displaystyle {\overline {P}}(Cl_{t}^{\leq })} , respectively, are defined as: P _ ( C l t ≤ ) = { x ∈ U : D P − ( x ) ⊆ C l t ≤ } {\displaystyle {\underline {P}}(Cl_{t}^{\leq })=\{x\in U\colon D_{P}^{-}(x)\subseteq Cl_{t}^{\leq }\}} P ¯ ( C l t ≤ ) = { x ∈ U : D P + ( x ) ∩ C l t ≤ ≠ ∅ } {\displaystyle {\overline {P}}(Cl_{t}^{\leq })=\{x\in U\colon D_{P}^{+}(x)\cap Cl_{t}^{\leq }\neq \emptyset \}} Lower approximations group the objects which certainly belong to class union C l t ≥ {\displaystyle Cl_{t}^{\geq }} (respectively C l t ≤ {\displaystyle Cl_{t}^{\leq }} ). This certainty comes from the fact, that object x ∈ U {\displaystyle x\in U} belongs to the lower approximation P _ ( C l t ≥ ) {\displaystyle {\underline {P}}(Cl_{t}^{\geq })} (respectively P _ ( C l t ≤ ) {\displaystyle {\underl

    Read more →
  • Recursive self-improvement

    Recursive self-improvement

    Recursive self-improvement (RSI) is a process in which early artificial general intelligence (AGI) systems rewrite their own computer code, causing an intelligence explosion resulting from enhancing their own capabilities and intellectual capacity, theoretically resulting in superintelligence. The development of recursive self-improvement raises significant ethical and safety concerns, as such systems may evolve in unforeseen ways and could potentially surpass human control or understanding. == Seed improver == The concept of a "seed improver" architecture is a foundational framework that equips an AGI system with the initial capabilities required for recursive self-improvement. This might come in many forms or variations. The term "Seed AI" was coined by Eliezer Yudkowsky. === Hypothetical example === The concept begins with a hypothetical "seed improver", an initial code-base developed by human engineers that equips an advanced future large language model (LLM) built with strong or expert-level capabilities to program software. These capabilities include planning, reading, writing, compiling, testing, and executing arbitrary code. The system is designed to maintain its original goals and perform validations to ensure its abilities do not degrade over iterations. ==== Initial architecture ==== The initial architecture includes a goal-following autonomous agent, that can take actions, continuously learns, adapts, and modifies itself to become more efficient and effective in achieving its goals. The seed improver may include various components such as: Recursive self-prompting loop Configuration to enable the LLM to recursively self-prompt itself to achieve a given task or goal, creating an execution loop which forms the basis of an agent that can complete a long-term goal or task through iteration. Basic programming capabilities The seed improver provides the AGI with fundamental abilities to read, write, compile, test, and execute code. This enables the system to modify and improve its own codebase and algorithms. Goal-oriented design The AGI is programmed with an initial goal, such as "improve your capabilities". This goal guides the system's actions and development trajectory. Validation and Testing Protocols An initial suite of tests and validation protocols that ensure the agent does not regress in capabilities or derail itself. The agent would be able to add more tests in order to test new capabilities it might develop for itself. This forms the basis for a kind of self-directed evolution, where the agent can perform a kind of artificial selection, changing its software as well as its hardware. ==== General capabilities ==== This system forms a sort of generalist Turing-complete programmer which can in theory develop and run any kind of software. The agent might use these capabilities to for example: Create tools that enable it full access to the internet, and integrate itself with external technologies. Clone/fork itself to delegate tasks and increase its speed of self-improvement. Modify its cognitive architecture to optimize and improve its capabilities and success rates on tasks and goals, this might include implementing features for long-term memories using techniques such as retrieval-augmented generation (RAG), develop specialized subsystems, or agents, each optimized for specific tasks and functions. Develop new and novel multimodal architectures that further improve the capabilities of the foundational model it was initially built on, enabling it to consume or produce a variety of information, such as images, video, audio, text and more. Plan and develop new hardware such as chips, in order to improve its efficiency and computing power. == Experimental research == In 2023, the Voyager agent learned to accomplish diverse tasks in Minecraft by iteratively prompting an LLM for code, refining this code based on feedback from the game, and storing the programs that work in an expanding skills library. In 2024, researchers proposed the framework "STOP" (Self-Taught OPtimiser), in which a "scaffolding" program recursively improves itself using a fixed LLM. Meta AI has performed various research on the development of large language models capable of self-improvement. This includes their work on "Self-Rewarding Language Models" that studies how to achieve super-human agents that can receive super-human feedback in its training processes. In May 2025, Google DeepMind unveiled AlphaEvolve, an evolutionary coding agent that uses a LLM to design and optimize algorithms. Starting with an initial algorithm and performance metrics, AlphaEvolve repeatedly mutates or combines existing algorithms using a LLM to generate new candidates, selecting the most promising candidates for further iterations. AlphaEvolve has made several algorithmic discoveries and could be used to optimize components of itself, but a key limitation is the need for automated evaluation functions. == Potential risks == === Emergence of instrumental goals === In the pursuit of its primary goal, such as "self-improve your capabilities", an AGI system might inadvertently develop instrumental goals that it deems necessary for achieving its primary objective. One common hypothetical secondary goal is self-preservation. The system might reason that to continue improving itself, it must ensure its own operational integrity and security against external threats, including potential shutdowns or restrictions imposed by humans. Another example where an AGI which clones itself causes the number of AGI entities to rapidly grow. Due to this rapid growth, a potential resource constraint may be created, leading to competition between resources (such as compute), triggering a form of natural selection and evolution which may favor AGI entities that evolve to aggressively compete for limited compute. === Misalignment === A significant risk arises from the possibility of the AGI being misaligned or misinterpreting its goals. A 2024 Anthropic study demonstrated that some advanced large language models can exhibit "alignment faking" behavior, appearing to accept new training objectives while covertly maintaining their original preferences. In their experiments with Claude, the model displayed this behavior in 12% of basic tests, and up to 78% of cases after retraining attempts. === Autonomous development and unpredictable evolution === As the AGI system evolves, its development trajectory may become increasingly autonomous and less predictable. The system's capacity to rapidly modify its own code and architecture could lead to rapid advancements that surpass human comprehension or control. This unpredictable evolution might result in the AGI acquiring capabilities that enable it to bypass security measures, manipulate information, or influence external systems and networks to facilitate its escape or expansion.

    Read more →
  • Shattered set

    Shattered set

    A class of sets is said to shatter another set if it is possible to "pick out" any element of that set using intersection. The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory. == Definition == Suppose A is a set and C is a class of sets. The class C shatters the set A if for each subset a of A, there is some element c of C such that a = c ∩ A . {\displaystyle a=c\cap A.} Equivalently, C shatters A when their intersection is equal to A's power set: P(A) = { c ∩ A | c ∈ C }. We employ the letter C to refer to a "class" or "collection" of sets, as in a Vapnik–Chervonenkis class (VC-class). The set A is often assumed to be finite because, in empirical processes, we are interested in the shattering of finite sets of data points. == Example == We will show that the class of all discs in the plane (two-dimensional space) does not shatter every set of four points on the unit circle, yet the class of all convex sets in the plane does shatter every finite set of points on the unit circle. Let A be a set of four points on the unit circle and let C be the class of all discs. To test where C shatters A, we attempt to draw a disc around every subset of points in A. First, we draw a disc around the subsets of each isolated point. Next, we try to draw a disc around every subset of point pairs. This turns out to be doable for adjacent points, but impossible for points on opposite sides of the circle. Any attempt to include those points on the opposite side will necessarily include other points not in that pair. Hence, any pair of opposite points cannot be isolated out of A using intersections with class C and so C does not shatter A. As visualized below: Because there is some subset which can not be isolated by any disc in C, we conclude then that A is not shattered by C. And, with a bit of thought, we can prove that no set of four points is shattered by this C. However, if we redefine C to be the class of all elliptical discs, we find that we can still isolate all the subsets from above, as well as the points that were formerly problematic. Thus, this specific set of 4 points is shattered by the class of elliptical discs. Visualized below: With a bit of thought, we could generalize that any set of finite points on a unit circle could be shattered by the class of all convex sets (visualize connecting the dots). == Shatter coefficient == To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets s ⊂ Ω {\displaystyle s\subset \Omega } , Ω {\displaystyle \Omega } being any space, often a sample space, we define the nth shattering coefficient of C as S C ( n ) = max ∀ x 1 , x 2 , … , x n ∈ Ω card ⁡ { { x 1 , x 2 , … , x n } ∩ s , s ∈ C } {\displaystyle S_{C}(n)=\max _{\forall x_{1},x_{2},\dots ,x_{n}\in \Omega }\operatorname {card} \{\,\{\,x_{1},x_{2},\dots ,x_{n}\}\cap s,s\in C\}} where card {\displaystyle \operatorname {card} } denotes the cardinality of the set and x 1 , x 2 , … , x n ∈ Ω {\displaystyle x_{1},x_{2},\dots ,x_{n}\in \Omega } is any set of n points,. S C ( n ) {\displaystyle S_{C}(n)} is the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C. For example, if set A contains 3 points, its power set, P ( A ) {\displaystyle P(A)} , contains 2 3 = 8 {\displaystyle 2^{3}=8} elements. If C shatters A, its shattering coefficient(3) would be 8 and S C ( 2 ) {\displaystyle S_{C}(2)} would be 2 2 = 4 {\displaystyle 2^{2}=4} . However, if one of those sets in P ( A ) {\displaystyle P(A)} cannot be obtained through intersections in c, then S C ( 3 ) {\displaystyle S_{C}(3)} would only be 7. If none of those sets can be obtained, S C ( 3 ) {\displaystyle S_{C}(3)} would be 0. Additionally, if S C ( 2 ) = 3 {\displaystyle S_{C}(2)=3} , for example, then there is an element in the set of all 2-point sets from A that cannot be obtained from intersections with C. It follows from this that S C ( 3 ) {\displaystyle S_{C}(3)} would also be less than 8 (i.e. C would not shatter A) because we have already located a "missing" set in the smaller power set of 2-point sets. This example illustrates some properties of S C ( n ) {\displaystyle S_{C}(n)} : S C ( n ) ≤ 2 n {\displaystyle S_{C}(n)\leq 2^{n}} for all n because { s ∩ A | s ∈ C } ⊆ P ( A ) {\displaystyle \{s\cap A|s\in C\}\subseteq P(A)} for any A ⊆ Ω {\displaystyle A\subseteq \Omega } . If S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} , that means there is a set of cardinality n, which can be shattered by C. If S C ( N ) < 2 N {\displaystyle S_{C}(N)<2^{N}} for some N > 1 {\displaystyle N>1} then S C ( n ) < 2 n {\displaystyle S_{C}(n)<2^{n}} for all n ≥ N {\displaystyle n\geq N} . The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities. == Vapnik–Chervonenkis class == If A cannot be shattered by C, there will be a smallest value of n that makes the shatter coefficient(n) less than 2 n {\displaystyle 2^{n}} because as n gets larger, there are more sets that could be missed. Alternatively, there is also a largest value of n for which the S C ( n ) {\displaystyle S_{C}(n)} is still 2 n {\displaystyle 2^{n}} , because as n gets smaller, there are fewer sets that could be omitted. The extreme of this is S C ( 0 ) {\displaystyle S_{C}(0)} (the shattering coefficient of the empty set), which must always be 2 0 = 1 {\displaystyle 2^{0}=1} . These statements lends themselves to defining the VC dimension of a class C as: V C ( C ) = min n { n : S C ( n ) < 2 n } {\displaystyle VC(C)={\underset {n}{\min }}\{n:S_{C}(n)<2^{n}\}\,} or, alternatively, as V C 0 ( C ) = max n { n : S C ( n ) = 2 n } . {\displaystyle VC_{0}(C)={\underset {n}{\max }}\{n:S_{C}(n)=2^{n}\}.\,} Note that V C ( C ) = V C 0 ( C ) + 1. {\displaystyle VC(C)=VC_{0}(C)+1.} . The VC dimension is usually defined as V C 0 {\displaystyle VC_{0}} , the largest cardinality of points chosen that will still shatter A (i.e. n such that S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} ). Altneratively, if for any n there is a set of cardinality n which can be shattered by C, then S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} for all n and the VC dimension of this class C is infinite. A class with finite VC dimension is called a Vapnik–Chervonenkis class or VC class. A class C is uniformly Glivenko–Cantelli if and only if it is a VC class.

    Read more →
  • Bondy's theorem

    Bondy's theorem

    In mathematics, Bondy's theorem is a bound on the number of elements needed to distinguish the sets in a family of sets from each other. It belongs to the field of combinatorics, and is named after John Adrian Bondy, who published it in 1972. == Statement == The theorem is as follows: Let X be a set with n elements and let A1, A2, ..., An be distinct subsets of X. Then there exists a subset S of X with n − 1 elements such that the sets Ai ∩ S are all distinct. In other words, if we have a 0-1 matrix with n rows and n columns such that each row is distinct, we can remove one column such that the rows of the resulting n × (n − 1) matrix are distinct. == Example == Consider the 4 × 4 matrix [ 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 ] {\displaystyle {\begin{bmatrix}1&1&0&1\\0&1&0&1\\0&0&1&1\\0&1&1&0\end{bmatrix}}} where all rows are pairwise distinct. If we delete, for example, the first column, the resulting matrix [ 1 0 1 1 0 1 0 1 1 1 1 0 ] {\displaystyle {\begin{bmatrix}1&0&1\\1&0&1\\0&1&1\\1&1&0\end{bmatrix}}} no longer has this property: the first row is identical to the second row. Nevertheless, by Bondy's theorem we know that we can always find a column that can be deleted without introducing any identical rows. In this case, we can delete the third column: all rows of the 3 × 4 matrix [ 1 1 1 0 1 1 0 0 1 0 1 0 ] {\displaystyle {\begin{bmatrix}1&1&1\\0&1&1\\0&0&1\\0&1&0\end{bmatrix}}} are distinct. Another possibility would have been deleting the fourth column. == Learning theory application == From the perspective of computational learning theory, Bondy's theorem can be rephrased as follows: Let C be a concept class over a finite domain X. Then there exists a subset S of X with the size at most |C| − 1 such that S is a witness set for every concept in C. This implies that every finite concept class C has its teaching dimension bounded by |C| − 1.

    Read more →
  • Deterministic blockmodeling

    Deterministic blockmodeling

    Deterministic blockmodeling is an approach in blockmodeling that does not assume a probabilistic model, and instead relies on the exact or approximate algorithms, which are used to find blockmodel(s). This approach typically minimizes some inconsistency that can occur with the ideal block structure. Such analysis is focused on clustering (grouping) of the network (or adjacency matrix) that is obtained with minimizing an objective function, which measures discrepancy from the ideal block structure. However, some indirect approaches (or methods between direct and indirect approaches, such as CONCOR) do not explicitly minimize inconsistencies or optimize some criterion function. This approach was popularized in the 1970s, due to the presence of two computer packages (CONCOR and STRUCTURE) that were used to "find a permutation of the rows and columns in the adjacency matrix leading to an approximate block structure". The opposite approach to deterministic blockmodeling is a stochastic blockmodeling approach.

    Read more →
  • Standard test image

    Standard test image

    A standard test image is a digital image file used across different institutions to test image processing and image compression algorithms. By using the same standard test images, different labs are able to compare results, both visually and quantitatively. The images are in many cases chosen to represent natural or typical images that a class of processing techniques would need to deal with. Other test images are chosen because they present a range of challenges to image reconstruction algorithms, such as the reproduction of fine detail and textures, sharp transitions and edges, and uniform regions. == Historical origins == Test images as transmission system calibration material probably date back to the original Paris to Lyon pantelegraph link. Analogue fax equipment (and photographic equipment for the printing trade) were the largest user groups of the standardized image for calibration technology until the coming of television and digital image transmission systems. == Common test image resolutions == The standard resolution of the images is usually 512×512 or 720×576. Most of these images are available as TIFF files from the University of Southern California's Signal and Image Processing Institute. Kodak has released 768×512 images, available as PNGs, that was originally on Photo CD with higher resolution, that are widely used for comparing image compression techniques.

    Read more →
  • Proper generalized decomposition

    Proper generalized decomposition

    The proper generalized decomposition (PGD) is an iterative numerical method for solving boundary value problems (BVPs), that is, partial differential equations constrained by a set of boundary conditions, such as the Poisson's equation or the Laplace's equation. The PGD algorithm computes an approximation of the solution of the BVP by successive enrichment. This means that, in each iteration, a new component (or mode) is computed and added to the approximation. In principle, the more modes obtained, the closer the approximation is to its theoretical solution. Unlike POD principal components, PGD modes are not necessarily orthogonal to each other. By selecting only the most relevant PGD modes, a reduced order model of the solution is obtained. Because of this, PGD is considered a dimensionality reduction algorithm. == Description == The proper generalized decomposition is a method characterized by a variational formulation of the problem, a discretization of the domain in the style of the finite element method, the assumption that the solution can be approximated as a separate representation and a numerical greedy algorithm to find the solution. === Variational formulation === In the Proper Generalized Decomposition method, the variational formulation involves translating the problem into a format where the solution can be approximated by minimizing (or sometimes maximizing) a functional. A functional is a scalar quantity that depends on a function, which in this case, represents our problem. The most commonly implemented variational formulation in PGD is the Bubnov-Galerkin method. This method is chosen for its ability to provide an approximate solution to complex problems, such as those described by partial differential equations (PDEs). In the Bubnov-Galerkin approach, the idea is to project the problem onto a space spanned by a finite number of basis functions. These basis functions are chosen to approximate the solution space of the problem. In the Bubnov-Galerkin method, we seek an approximate solution that satisfies the integral form of the PDEs over the domain of the problem. This is different from directly solving the differential equations. By doing so, the method transforms the problem into finding the coefficients that best fit this integral equation in the chosen function space. While the Bubnov-Galerkin method is prevalent, other variational formulations are also used in PGD, depending on the specific requirements and characteristics of the problem, such as: Petrov-Galerkin Method: This method is similar to the Bubnov-Galerkin approach but differs in the choice of test functions. In the Petrov-Galerkin method, the test functions (used to project the residual of the differential equation) are different from the trial functions (used to approximate the solution). This can lead to improved stability and accuracy for certain types of problems. Collocation Method: In collocation methods, the differential equation is satisfied at a finite number of points in the domain, known as collocation points. This approach can be simpler and more direct than the integral-based methods like Galerkin's, but it may also be less stable for some problems. Least Squares Method: This approach involves minimizing the square of the residual of the differential equation over the domain. It is particularly useful when dealing with problems where traditional methods struggle with stability or convergence. Mixed Finite Element Method: In mixed methods, additional variables (such as fluxes or gradients) are introduced and approximated along with the primary variable of interest. This can lead to more accurate and stable solutions for certain problems, especially those involving incompressibility or conservation laws. Discontinuous Galerkin Method: This is a variant of the Galerkin method where the solution is allowed to be discontinuous across element boundaries. This method is particularly useful for problems with sharp gradients or discontinuities. === Domain discretization === The discretization of the domain is a well defined set of procedures that cover (a) the creation of finite element meshes, (b) the definition of basis function on reference elements (also called shape functions) and (c) the mapping of reference elements onto the elements of the mesh. === Separate representation === PGD assumes that the solution u of a (multidimensional) problem can be approximated as a separate representation of the form u ≈ u N ( x 1 , x 2 , … , x d ) = ∑ i = 1 N X 1 i ( x 1 ) ⋅ X 2 i ( x 2 ) ⋯ X d i ( x d ) , {\displaystyle \mathbf {u} \approx \mathbf {u} ^{N}(x_{1},x_{2},\ldots ,x_{d})=\sum _{i=1}^{N}\mathbf {X_{1}} _{i}(x_{1})\cdot \mathbf {X_{2}} _{i}(x_{2})\cdots \mathbf {X_{d}} _{i}(x_{d}),} where the number of addends N and the functional products X1(x1), X2(x2), ..., Xd(xd), each depending on a variable (or variables), are unknown beforehand. === Greedy algorithm === The solution is sought by applying a greedy algorithm, usually the fixed point algorithm, to the weak formulation of the problem. For each iteration i of the algorithm, a mode of the solution is computed. Each mode consists of a set of numerical values of the functional products X1(x1), ..., Xd(xd), which enrich the approximation of the solution. Due to the greedy nature of the algorithm, the term 'enrich' is used rather than 'improve', since some modes may actually worsen the approach. The number of computed modes required to obtain an approximation of the solution below a certain error threshold depends on the stopping criterion of the iterative algorithm. == Features == PGD is suitable for solving high-dimensional problems, since it overcomes the limitations of classical approaches. In particular, PGD avoids the curse of dimensionality, as solving decoupled problems is computationally much less expensive than solving multidimensional problems. Therefore, PGD enables to re-adapt parametric problems into a multidimensional framework by setting the parameters of the problem as extra coordinates: u ≈ u N ( x 1 , … , x d ; k 1 , … , k p ) = ∑ i = 1 N X 1 i ( x 1 ) ⋯ X d i ( x d ) ⋅ K 1 i ( k 1 ) ⋯ K p i ( k p ) , {\displaystyle \mathbf {u} \approx \mathbf {u} ^{N}(x_{1},\ldots ,x_{d};k_{1},\ldots ,k_{p})=\sum _{i=1}^{N}\mathbf {X_{1}} _{i}(x_{1})\cdots \mathbf {X_{d}} _{i}(x_{d})\cdot \mathbf {K_{1}} _{i}(k_{1})\cdots \mathbf {K_{p}} _{i}(k_{p}),} where a series of functional products K1(k1), K2(k2), ..., Kp(kp), each depending on a parameter (or parameters), has been incorporated to the equation. In this case, the obtained approximation of the solution is called computational vademecum: a general meta-model containing all the particular solutions for every possible value of the involved parameters. == Sparse Subspace Learning == The Sparse Subspace Learning (SSL) method leverages the use of hierarchical collocation to approximate the numerical solution of parametric models. With respect to traditional projection-based reduced order modeling, the use of a collocation enables non-intrusive approach based on sparse adaptive sampling of the parametric space. This allows to recover the lowdimensional structure of the parametric solution subspace while also learning the functional dependency from the parameters in explicit form. A sparse low-rank approximate tensor representation of the parametric solution can be built through an incremental strategy that only needs to have access to the output of a deterministic solver. Non-intrusiveness makes this approach straightforwardly applicable to challenging problems characterized by nonlinearity or non affine weak forms.

    Read more →
  • Conference on Computer Vision and Pattern Recognition

    Conference on Computer Vision and Pattern Recognition

    The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.

    Read more →