The AI Con: How to Fight Big Tech's Hype and Create the Future We Want is a 2025 non-fiction book by linguist Emily M. Bender and sociologist Alex Hanna. It argues that much of what is labeled "artificial intelligence" is a misleading term that obscures ordinary automation while concentrating power in a small number of technology firms. The book was published in May 2025 by Harper in the United States and Bodley Head in the United Kingdom. It was developed alongside the authors' long-running podcast Mystery AI Hype Theater 3000, which critiques exaggerated claims about AI. == Synopsis == The authors present AI as a marketing umbrella that encourages audiences to infer understanding and agency where none exist. They argue readers should treat such language skeptically and to separate specific automated tasks from broad claims of intelligence. The book describes a recurring hype cycle in which corporate narratives justify data and labor extraction, the replacement of human services with cheaper substitutes, and the diversion of attention from present harms to speculative futures. While acknowledging limited uses such as pattern recognition, the authors argue that contemporary systems are best understood as text and media generators shaped by training data and human labor, not as thinking or reasoning entities. A central theme is the social and environmental cost of scaling these systems, including increased energy and water use, the appropriation of creative work for training, and the outsourcing of ghost work to low-paid data workers worldwide. These costs are linked to workplace effects, with the authors arguing that automation rarely eliminates jobs outright and more often degrades them through surveillance, work intensification, and unpaid oversight. As alternatives to passive adoption, the authors propose concrete responses: asking precise questions about what is being automated and why, demanding transparency about data and evaluation, and practicing what they call strategic refusal when deployment conflicts with evidence or values. The book also develops a vocabulary for public debate, rejecting both boosterish and doomerish narratives as grounded in the same assumption that AI is a singular, autonomous force. The authors recommend reading strategies such as favoring trusted human sources over automated summaries and using humor to deflate inflated claims. They describe a link between language to policy and power, arguing that precise terminology can help policymakers and the public resist austerity-driven automation and demand accountability for errors and harms. == Reception == The Guardian praised the book's myth-busting approach and its analysis of how hype erodes cultural and civic life by normalizing synthetic media as a substitute for human judgment. Kirkus Reviews described it as a contrarian account that catalogs concrete risks while cutting through speculative predictions. An interview in Business Insider highlighted the authors' accessible frameworks, including their proposal to describe chatbots as conversation simulators and to evaluate systems in terms of values, labor, and evidence. Coverage in GeekWire emphasized the book's call for resistance through collective bargaining, stronger data rights, and a norm of rejecting deployments that fail basic standards of necessity and evaluation. Some reviews were more critical. A review in LLRX argued that the book's tone could be overly polemical and that it gave limited attention to potential benefits claimed for generative systems. Coverage in the Financial Times, focused on Bender's broader public scholarship, situated the book within her long-standing critique of anthropomorphic narratives about large language models and her advocacy for more democratic oversight of automated systems.
Developmental robotics
Developmental robotics (DevRob), sometimes called epigenetic robotics, is a scientific field which aims at studying the developmental mechanisms, architectures and constraints that allow lifelong and open-ended learning of new skills and new knowledge in embodied machines. As in human children, learning is expected to be cumulative and of progressively increasing complexity, and to result from self-exploration of the world in combination with social interaction. The typical methodological approach consists in starting from theories of human and animal development elaborated in fields such as developmental psychology, neuroscience, developmental and evolutionary biology, and linguistics, then to formalize and implement them in robots, sometimes exploring extensions or variants of them. The experimentation of those models in robots allows researchers to confront them with reality, and as a consequence, developmental robotics also provides feedback and novel hypotheses on theories of human and animal development. Developmental robotics is related to but differs from evolutionary robotics (ER). ER uses populations of robots that evolve over time, whereas DevRob is interested in how the organization of a single robot's control system develops through experience, over time. DevRob is also related to work done in the domains of robotics and artificial life. == Background == Can a robot learn like a child? Can it learn a variety of new skills and new knowledge unspecified at design time and in a partially unknown and changing environment? How can it discover its body and its relationships with the physical and social environment? How can its cognitive capacities continuously develop without the intervention of an engineer once it is "out of the factory"? What can it learn through natural social interactions with humans? These are the questions at the center of developmental robotics. Alan Turing, as well as a number of other pioneers of cybernetics, already formulated those questions and the general approach in 1950, but it is only since the end of the 20th century that they began to be investigated systematically. Because the concept of adaptive intelligent machines is central to developmental robotics, it has relationships with fields such as artificial intelligence, machine learning, cognitive robotics or computational neuroscience. Yet, while it may reuse some of the techniques elaborated in these fields, it differs from them from many perspectives. It differs from classical artificial intelligence because it does not assume the capability of advanced symbolic reasoning and focuses on embodied and situated sensorimotor and social skills rather than on abstract symbolic problems. It differs from cognitive robotics because it focuses on the processes that allow the formation of cognitive capabilities rather than these capabilities themselves. It differs from computational neuroscience because it focuses on functional modeling of integrated architectures of development and learning. More generally, developmental robotics is uniquely characterized by the following three features: It targets task-independent architectures and learning mechanisms, i.e. the machine/robot has to be able to learn new tasks that are unknown by the engineer; It emphasizes open-ended development and lifelong learning, i.e. the capacity of an organism to acquire continuously novel skills. This should not be understood as a capacity for learning "anything" or even “everything”, but just that the set of skills that is acquired can be infinitely extended at least in some (not all) directions; The complexity of acquired knowledge and skills shall increase (and the increase be controlled) progressively. Developmental robotics emerged at the crossroads of several research communities including embodied artificial intelligence, enactive and dynamical systems cognitive science, connectionism. Starting from the essential idea that learning and development happen as the self-organized result of the dynamical interactions among brains, bodies and their physical and social environment, and trying to understand how this self-organization can be harnessed to provide task-independent lifelong learning of skills of increasing complexity, developmental robotics strongly interacts with fields such as developmental psychology, developmental and cognitive neuroscience, developmental biology (embryology), evolutionary biology, and cognitive linguistics. As many of the theories coming from these sciences are verbal and/or descriptive, this implies a crucial formalization and computational modeling activity in developmental robotics. These computational models are then not only used as ways to explore how to build more versatile and adaptive machines but also as a way to evaluate their coherence and possibly explore alternative explanations for understanding biological development. == Research directions == === Skill domains === Due to the general approach and methodology, developmental robotics projects typically focus on having robots develop the same types of skills as human infants. A first category that is important being investigated is the acquisition of sensorimotor skills. These include the discovery of one's own body, including its structure and dynamics such as hand-eye coordination, locomotion, and interaction with objects as well as tool use, with a particular focus on the discovery and learning of affordances. A second category of skills targeted by developmental robots are social and linguistic skills: the acquisition of simple social behavioural games such as turn-taking, coordinated interaction, lexicons, syntax and grammar, and the grounding of these linguistic skills into sensorimotor skills (sometimes referred as symbol grounding). In parallel, the acquisition of associated cognitive skills are being investigated such as the emergence of the self/non-self distinction, the development of attentional capabilities, of categorization systems and higher-level representations of affordances or social constructs, of the emergence of values, empathy, or theories of mind. === Mechanisms and constraints === The sensorimotor and social spaces in which humans and robot live are so large and complex that only a small part of potentially learnable skills can actually be explored and learnt within a life-time. Thus, mechanisms and constraints are necessary to guide developmental organisms in their development and control of the growth of complexity. There are several important families of these guiding mechanisms and constraints which are studied in developmental robotics, all inspired by human development: Motivational systems, generating internal reward signals that drive exploration and learning, which can be of two main types: extrinsic motivations push robots/organisms to maintain basic specific internal properties such as food and water level, physical integrity, or light (e.g. in phototropic systems); intrinsic motivations push robot to search for novelty, challenge, compression or learning progress per se, thus generating what is sometimes called curiosity-driven learning and exploration, or alternatively active learning and exploration; Social guidance: as humans learn a lot by interacting with their peers, developmental robotics investigates mechanisms that can allow robots to participate to human-like social interaction. By perceiving and interpreting social cues, this may allow robots both to learn from humans (through diverse means such as imitation, emulation, stimulus enhancement, demonstration, etc. ...) and to trigger natural human pedagogy. Thus, social acceptance of developmental robots is also investigated; Statistical inference biases and cumulative knowledge/skill reuse: biases characterizing both representations/encodings and inference mechanisms can typically allow considerable improvement of the efficiency of learning and are thus studied. Related to this, mechanisms allowing to infer new knowledge and acquire new skills by reusing previously learnt structures is also an essential field of study; The properties of embodiment, including geometry, materials, or innate motor primitives/synergies often encoded as dynamical systems, can considerably simplify the acquisition of sensorimotor or social skills, and is sometimes referred as morphological computation. The interaction of these constraints with other constraints is an important axis of investigation; Maturational constraints: In human infants, both the body and the neural system grow progressively, rather than being full-fledged already at birth. This implies, for example, that new degrees of freedom, as well as increases of the volume and resolution of available sensorimotor signals, may appear as learning and development unfold. Transposing these mechanisms in developmental robots, and understanding how it may hinder or on the contrary ease the acquisition of novel complex skills is a central questi
VIGRA
VIGRA is the abbreviation for "Vision with Generic Algorithms". It is a free open-source computer vision library which focuses on customizable algorithms and data structures. VIGRA component can be easily adapted to specific needs of target application without compromising execution speed, by using template techniques similar to those in the C++ Standard Template Library. == Features == VIGRA is cross-platform, with working builds on Microsoft Windows, Mac OS X, Linux, and OpenBSD. Since version 1.7.1, VIGRA provides Python bindings based on numpy framework. == History == VIGRA was originally designed and implemented by scientists at University of Hamburg faculty of computer science; its core maintainers are now working at Heidelberg Collaboratory for Image Processing (HCI) University of Heidelberg. In the meantime, many developers have contributed to the project. == Application == CellCognition and ilastik uses VIGRA computer vision library. OpenOffice.org uses VIGRA as part of its headless software rendering backend; LibreOffice does so until version 5.2.
Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. == Graphical model == Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Each edge represents a direct conditional dependency. Any pair of nodes that are not connected (i.e. no path connects one node to the other) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m {\displaystyle m} parent nodes represent m {\displaystyle m} Boolean variables, then the probability function could be represented by a table of 2 m {\displaystyle 2^{m}} entries, one entry for each of the 2 m {\displaystyle 2^{m}} possible parent combinations. Similar ideas may be applied to undirected, and possibly cyclic, graphs such as Markov networks. == Example == Suppose we want to model the dependencies between three variables: the sprinkler (or more appropriately, its state - whether it is on or not), the presence or absence of rain and whether the grass is wet or not. Observe that two events can cause the grass to become wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler usually is not active). This situation can be modeled with a Bayesian network (shown to the right). Each variable has two possible values, T (for true) and F (for false). The joint probability function is, by the chain rule of probability, Pr ( G , S , R ) = Pr ( G ∣ S , R ) Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)} where G = "Grass wet (true/false)", S = "Sprinkler turned on (true/false)", and R = "Raining (true/false)". The model can answer questions about the presence of a cause given the presence of an effect (so-called inverse probability) like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables: Pr ( R = T ∣ G = T ) = Pr ( G = T , R = T ) Pr ( G = T ) = ∑ x ∈ { T , F } Pr ( G = T , S = x , R = T ) ∑ x , y ∈ { T , F } Pr ( G = T , S = x , R = y ) {\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{x\in \{T,F\}}\Pr(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}\Pr(G=T,S=x,R=y)}}} Using the expansion for the joint probability function Pr ( G , S , R ) {\displaystyle \Pr(G,S,R)} and the conditional probabilities from the conditional probability tables (CPTs) stated in the diagram, one can evaluate each term in the sums in the numerator and denominator. For example, Pr ( G = T , S = T , R = T ) = Pr ( G = T ∣ S = T , R = T ) Pr ( S = T ∣ R = T ) Pr ( R = T ) = 0.99 × 0.01 × 0.2 = 0.00198. {\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}} Then the numerical results (subscripted by the associated variable values) are Pr ( R = T ∣ G = T ) = 0.00198 T T T + 0.1584 T F T 0.00198 T T T + 0.288 T T F + 0.1584 T F T + 0.0 T F F = 891 2491 ≈ 35.77 % . {\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.} To answer an interventional question, such as "What is the probability that it would rain, given that we wet the grass?" the answer is governed by the post-intervention joint distribution function Pr ( S , R ∣ do ( G = T ) ) = Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)\Pr(R)} obtained by removing the factor Pr ( G ∣ S , R ) {\displaystyle \Pr(G\mid S,R)} from the pre-intervention distribution. The do operator forces the value of G to be true. The probability of rain is unaffected by the action: Pr ( R ∣ do ( G = T ) ) = Pr ( R ) . {\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).} To predict the impact of turning the sprinkler on: Pr ( R , G ∣ do ( S = T ) ) = Pr ( R ) Pr ( G ∣ R , S = T ) {\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)} with the term Pr ( S = T ∣ R ) {\displaystyle \Pr(S=T\mid R)} removed, showing that the action affects the grass but not the rain. These predictions may not be feasible given unobserved variables, as in most policy evaluation problems. The effect of the action do ( x ) {\displaystyle {\text{do}}(x)} can still be predicted, however, whenever the back-door criterion is satisfied. It states that, if a set Z of nodes can be observed that d-separates (or blocks) all back-door paths from X to Y then Pr ( Y , Z ∣ do ( x ) ) = Pr ( Y , Z , X = x ) Pr ( X = x ∣ Z ) . {\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.} A back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called "sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect of S = T on G, because R d-separates the (only) back-door path S ← R → G. However, if S is not observed, no other set d-separates this path and the effect of turning the sprinkler on (S = T) on the grass (G) cannot be predicted from passive observations. In that case P(G | do(S = T)) is not "identified". This reflects the fact that, lacking interventional data, the observed dependence between S and G is due to a causal connection or is spurious (apparent dependence arising from a common cause, R). (see Simpson's paradox) To determine whether a causal relation is identified from an arbitrary Bayesian network with unobserved variables, one can use the three rules of "do-calculus" and test whether all do terms can be removed from the expression of that relation, thus confirming that the desired quantity is estimable from frequency data. Using a Bayesian network can save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for 2 10 = 1024 {\displaystyle 2^{10}=1024} values. If no variable's local distribution depends on more than three parent variables, the Bayesian network representation stores at most 10 ⋅ 2 3 = 80 {\displaystyle 10\cdot 2^{3}=80} values. One advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distributions. == Inference and learning == Bayesian networks perform three main inference tasks: Inferring unobserved variables Parameter learning for the probability distributions of each node in the network Structure learning of the graphical network === Inferring unobserved variables === Because a Bayesian network is a complete model for its variables and their relationships, it can be used to answer probabilistic queries about them. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. The posterior gives a universal sufficient statistic for detection applications, when choosing values for the variable subset that minimize some expected loss function, for instance the probability of decision error. A Bayesian network can thus be considered a mechanism for automatically applying Bayes' theorem to complex problems. The most common exact inference methods are: variable elimination, which eliminates (by integration or summation) the non-observed non-query variables one by one by distributing the sum over the prod
Polynomial kernel
In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features. == Definition == For degree-d polynomials, the polynomial kernel is defined as K ( x , y ) = ( x T y + c ) d {\displaystyle K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}} where x and y are vectors of size n in the input space, i.e. vectors of features computed from training or test samples and c ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When c = 0, the kernel is called homogeneous. (A further generalized polykernel divides xTy by a user-specified scalar parameter a.) As a kernel, K corresponds to an inner product in a feature space based on some mapping φ: K ( x , y ) = ⟨ φ ( x ) , φ ( y ) ⟩ {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle } The nature of φ can be seen from an example. Let d = 2, so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping, K ( x , y ) = ( ∑ i = 1 n x i y i + c ) 2 = ∑ i = 1 n ( x i 2 ) ( y i 2 ) + ∑ i = 2 n ∑ j = 1 i − 1 ( 2 x i x j ) ( 2 y i y j ) + ∑ i = 1 n ( 2 c x i ) ( 2 c y i ) + c 2 {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}} From this it follows that the feature map is given by: φ ( x ) = ( x n 2 , … , x 1 2 , 2 x n x n − 1 , … , 2 x n x 1 , 2 x n − 1 x n − 2 , … , 2 x n − 1 x 1 , … , 2 x 2 x 1 , 2 c x n , … , 2 c x 1 , c ) {\displaystyle \varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)} generalizing for ( x T y + c ) d {\displaystyle \left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}} , where x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} , y ∈ R n {\displaystyle \mathbf {y} \in \mathbb {R} ^{n}} and applying the multinomial theorem: ( x T y + c ) d = ∑ j 1 + j 2 + ⋯ + j n + 1 = d d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 d ! j 1 ! ⋯ j n ! j n + 1 ! y 1 j 1 ⋯ y n j n c j n + 1 = φ ( x ) T φ ( y ) {\displaystyle {\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}} The last summation has l d = ( n + d d ) {\displaystyle l_{d}={\tbinom {n+d}{d}}} elements, so that: φ ( x ) = ( a 1 , … , a l , … , a l d ) {\displaystyle \varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)} where l = ( j 1 , j 2 , . . . , j n , j n + 1 ) {\displaystyle l=(j_{1},j_{2},...,j_{n},j_{n+1})} and a l = d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 | j 1 + j 2 + ⋯ + j n + j n + 1 = d {\displaystyle a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d} == Practical use == Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP). The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including: full expansion of the kernel prior to training/testing with a linear SVM, i.e. full computation of the mapping φ as in polynomial regression; basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion; inverted indexing of support vectors. One problem with the polynomial kernel is that it may suffer from numerical instability: when xTy + c < 1, K(x, y) = (xTy + c)d tends to zero with increasing d, whereas when xTy + c > 1, K(x, y) tends to infinity.
Topological deep learning
Topological deep learning (TDL) is a research field that extends deep learning to handle complex, non-Euclidean data structures. Traditional deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), excel in processing data on regular grids and sequences. However, scientific and real-world data often exhibit more intricate data domains encountered in scientific computations, including point clouds, meshes, time series, scalar fields graphs, or general topological spaces like simplicial complexes and CW complexes. TDL addresses this by incorporating topological concepts to process data with higher-order relationships, such as interactions among multiple entities and complex hierarchies. This approach leverages structures like simplicial complexes and hypergraphs to capture global dependencies and qualitative spatial properties, offering a more nuanced representation of data. TDL also encompasses methods from computational and algebraic topology that permit studying properties of neural networks and their training process, such as their predictive performance or generalization properties. The mathematical foundations of TDL are algebraic topology, differential topology, and geometric topology. Therefore, TDL can be generalized for data on differentiable manifolds, knots, links, tangles, curves, etc. == History and motivation == Traditional techniques from deep learning often operate under the assumption that a dataset is residing in a highly-structured space (like images, where convolutional neural networks exhibit outstanding performance over alternative methods) or a Euclidean space. The prevalence of new types of data, in particular graphs, meshes, and molecules, resulted in the development of new techniques, culminating in the field of geometric deep learning, which originally proposed a signal-processing perspective for treating such data types. While originally confined to graphs, where connectivity is defined based on nodes and edges, follow-up work extended concepts to a larger variety of data types, including simplicial complexes and CW complexes, with recent work proposing a unified perspective of message-passing on general combinatorial complexes. An independent perspective on different types of data originated from topological data analysis, which proposed a new framework for describing structural information of data, i.e., their "shape," that is inherently aware of multiple scales in data, ranging from local information to global information. While at first restricted to smaller datasets, subsequent work developed new descriptors that efficiently summarized topological information of datasets to make them available for traditional machine-learning techniques, such as support vector machines or random forests. Such descriptors ranged from new techniques for feature engineering over new ways of providing suitable coordinates for topological descriptors, or the creation of more efficient dissimilarity measures. Contemporary research in this field is largely concerned with either integrating information about the underlying data topology into existing deep-learning models or obtaining novel ways of training on topological domains. == Learning on topological spaces == One of the core concepts in topological deep learning is considering the domain upon which this data is defined and supported. In case of Euclidean data, such as images, this domain is a grid, upon which the pixel value of the image is supported. In a more general setting this domain might be a topological domain. Studying and developing deep learning models that are supported ln topological domains constitute the essence of topological deep learning. Next, we introduce the most common topological domains that are encountered in a deep learning setting. These domains include, but not limited to, graphs, simplicial complexes, cell complexes, combinatorial complexes and hypergraphs. Given a finite set S of abstract entities, a neighborhood function N {\displaystyle {\mathcal {N}}} on S is an assignment that attach to every point x {\displaystyle x} in S a subset of S or a relation. Such a function can be induced by equipping S with an auxiliary structure. Edges provide one way of defining relations among the entities of S. More specifically, edges in a graph allow one to define the notion of neighborhood using, for instance, the one hop neighborhood notion. Edges however, limited in their modeling capacity as they can only be used to model binary relations among entities of S since every edge is connected typically to two entities. In many applications, it is desirable to permit relations that incorporate more than two entities. The idea of using relations that involve more than two entities is central to topological domains. Such higher-order relations allow for a broader range of neighborhood functions to be defined on S to capture multi-way interactions among entities of S. Next we review the main properties, advantages, and disadvantages of some commonly studied topological domains in the context of deep learning, including (abstract) simplicial complexes, regular cell complexes, hypergraphs, and combinatorial complexes. ==== Comparisons among topological domains ==== Each of the enumerated topological domains has its own characteristics, advantages, and limitations: Simplicial complexes Simplest form of higher-order domains. Extensions of graph-based models. Admit hierarchical structures, making them suitable for various applications. Hodge theory can be naturally defined on simplicial complexes. Require relations to be subsets of larger relations, imposing constraints on the structure. Cell Complexes Generalize simplicial complexes. Provide more flexibility in defining higher-order relations. Each cell in a cell complex is homeomorphic to an open ball, attached together via attaching maps. Boundary cells of each cell in a cell complex are also cells in the complex. Represented combinatorially via incidence matrices. Hypergraphs Allow arbitrary set-type relations among entities. Relations are not imposed by other relations, providing more flexibility. Do not explicitly encode the dimension of cells or relations. Useful when relations in the data do not adhere to constraints imposed by other models like simplicial and cell complexes. Combinatorial Complexes : Generalize and bridge the gaps between simplicial complexes, cell complexes, and hypergraphs. Allow for hierarchical structures and set-type relations. Combine features of other complexes while providing more flexibility in modeling relations. Can be represented combinatorially, similar to cell complexes. ==== Hierarchical structure and set-type relations ==== The properties of simplicial complexes, cell complexes, and hypergraphs give rise to two main features of relations on higher-order domains, namely hierarchies of relations and set-type relations. ===== Rank function ===== A rank function on a higher-order domain X is an order-preserving function rk: X → Z, where rk(x) attaches a non-negative integer value to each relation x in X, preserving set inclusion in X. Cell and simplicial complexes are common examples of higher-order domains equipped with rank functions and therefore with hierarchies of relations. ===== Set-type relations ===== Relations in a higher-order domain are called set-type relations if the existence of a relation is not implied by another relation in the domain. Hypergraphs constitute examples of higher-order domains equipped with set-type relations. Given the modeling limitations of simplicial complexes, cell complexes, and hypergraphs, we develop the combinatorial complex, a higher-order domain that features both hierarchies of relations and set-type relations. The learning tasks in TDL can be broadly classified into three categories: Cell classification: Predict targets for each cell in a complex. Examples include triangular mesh segmentation, where the task is to predict the class of each face or edge in a given mesh. Complex classification: Predict targets for an entire complex. For example, predict the class of each input mesh. Cell prediction: Predict properties of cell-cell interactions in a complex, and in some cases, predict whether a cell exists in the complex. An example is the prediction of linkages among entities in hyperedges of a hypergraph. In practice, to perform the aforementioned tasks, deep learning models designed for specific topological spaces must be constructed and implemented. These models, known as topological neural networks, are tailored to operate effectively within these spaces. === Topological neural networks === Central to TDL are topological neural networks (TNNs), specialized architectures designed to operate on data structured in topological domains. Unlike traditional neural networks tailored for grid-like structures, TNNs are adept at handling more intricate data representations, such as graphs
Ho–Kashyap algorithm
The Ho–Kashyap algorithm is an iterative method in machine learning for finding a linear decision boundary that separates two linearly separable classes. It was developed by Yu-Chi Ho and Rangasami L. Kashyap in 1965, and usually presented as a problem in linear programming. == Setup == Given a training set consisting of samples from two classes, the Ho–Kashyap algorithm seeks to find a weight vector w {\displaystyle \mathbf {w} } and a margin vector b {\displaystyle \mathbf {b} } such that: Y w = b {\displaystyle \mathbf {Yw} =\mathbf {b} } where Y {\displaystyle \mathbf {Y} } is the augmented data matrix with samples from both classes (with appropriate sign conventions, e.g., samples from class 2 are negated), w {\displaystyle \mathbf {w} } is the weight vector to be determined, and b {\displaystyle \mathbf {b} } is a positive margin vector. The algorithm minimizes the criterion function: J ( w , b ) = | | Y w − b | | 2 {\displaystyle J(\mathbf {w} ,\mathbf {b} )=||\mathbf {Yw} -\mathbf {b} ||^{2}} subject to the constraint that b > 0 {\displaystyle \mathbf {b} >\mathbf {0} } (element-wise). Given a problem of linearly separating two classes, we consider a dataset of elements { ( x i , y i ) } i ∈ 1 : N {\displaystyle \{(\mathbf {x_{i}} ,y_{i})\}_{i\in 1:N}} where y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} . Linearly separating them by a perceptron is equivalent to finding weight and bias w , b {\displaystyle \mathbf {w} ,b} for a perceptron, such that: [ y 1 x 1 1 ⋮ ⋮ y N x N 1 ] [ w b ] > 0 {\displaystyle {\begin{bmatrix}y_{1}\mathbf {x} _{1}&1\\\vdots &\vdots \\y_{N}\mathbf {x} _{N}&1\\\end{bmatrix}}{\begin{bmatrix}\mathbf {w} \\b\end{bmatrix}}>0} == Algorithm == The idea of the Ho–Kashyap algorithm is as follows: Given any b {\displaystyle \mathbf {b} } , the corresponding w {\displaystyle \mathbf {w} } is known: It is simply w = Y + b {\displaystyle \mathbf {w} =\mathbf {Y} ^{+}\mathbf {b} } , where Y + {\displaystyle \mathbf {Y} ^{+}} denotes the Moore–Penrose pseudoinverse of Y {\displaystyle \mathbf {Y} } . Therefore, it only remains to find b {\displaystyle \mathbf {b} } by gradient descent. However, the gradient descent may sometimes decrease some of the coordinates of b {\displaystyle \mathbf {b} } , which may cause some coordinates of b {\displaystyle \mathbf {b} } to become negative, which is undesirable. Therefore, whenever some coordinates of b {\displaystyle \mathbf {b} } would have decreased, those coordinates are unchanged instead. As for the coordinates of b {\displaystyle \mathbf {b} } that would increase, those would increase without issue. Formally, the algorithm is as follows: Initialization: Set b ( 0 ) {\displaystyle \mathbf {b} (0)} to an arbitrary positive vector, typically b ( 0 ) = 1 {\displaystyle \mathbf {b} (0)=\mathbf {1} } (a vector of ones). Set the iteration counter k = 0 {\displaystyle k=0} . Set w ( 0 ) = Y + b ( 0 ) {\displaystyle \mathbf {w} (0)=\mathbf {Y} ^{+}\mathbf {b} (0)} Loop until convergence, or until iteration counter exceeds some k m a x {\displaystyle k_{max}} . Error calculation: Compute the error vector: e ( k ) = Y w ( k ) − b ( k ) {\displaystyle \mathbf {e} (k)=\mathbf {Yw} (k)-\mathbf {b} (k)} . Margin update: Update the margin vector: b ( k + 1 ) = b ( k ) + 2 η k ( e ( k ) + | e ( k ) | ) {\displaystyle \mathbf {b} (k+1)=\mathbf {b} (k)+2\eta _{k}(\mathbf {e} (k)+|\mathbf {e} (k)|)} where η k {\displaystyle \eta _{k}} is a positive learning rate parameter, and | e ( k ) | {\displaystyle |\mathbf {e} (k)|} denotes the element-wise absolute value. Weight calculation: Compute the weight vector using the pseudoinverse: w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} . Convergence check: If | | e ( k ) | | ≤ θ {\displaystyle ||\mathbf {e} (k)||\leq \theta } for some predetermined threshold θ {\displaystyle \theta } (close to zero), then return b ( k + 1 ) , w ( k + 1 ) {\displaystyle \mathbf {b} (k+1),\mathbf {w} (k+1)} . if e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } (all components non-positive), return "Samples not separable.". Return "Algorithm failed to converge in time.". == Properties == If the training data is linearly separable, the algorithm converges to a solution (where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } ) in a finite number of iterations. If the data is not linearly separable, the algorithm may or may not ever reach the point where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } . However, if it does happen that e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } at some iteration, this proves non-separability. The convergence rate depends on the choice of the learning rate parameter ρ {\displaystyle \rho } and the degree of linear separability of the data. == Relationship to other algorithms == Perceptron algorithm: Both seek linear separators. The perceptron updates weights incrementally based on individual misclassified samples, while Ho–Kashyap is a batch method that processes all samples to compute the pseudoinverse and updates based on an overall error vector. Linear discriminant analysis (LDA): LDA assumes underlying Gaussian distributions with equal covariances for the classes and derives the decision boundary from these statistical assumptions. Ho–Kashyap makes no explicit distributional assumptions and instead tries to solve a system of linear inequalities directly. Support vector machines (SVM): For linearly separable data, SVMs aim to find the maximum-margin hyperplane. The Ho–Kashyap algorithm finds a separating hyperplane but not necessarily the one with the maximum margin. If the data is not separable, soft-margin SVMs allow for some misclassifications by optimizing a trade-off between margin size and misclassification penalty, while Ho–Kashyap provides a least-squares solution. == Variants == Modified Ho–Kashyap algorithm changes weight calculation step w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} to w ( k + 1 ) = w ( k ) + η k Y + | e ( k ) | {\displaystyle \mathbf {w} (k+1)=\mathbf {w} (k)+\eta _{k}\mathbf {Y} ^{+}|\mathbf {e} (k)|} . Kernel Ho–Kashyap algorithm: Applies kernel methods (the "kernel trick") to the Ho–Kashyap framework to enable non-linear classification by implicitly mapping data to a higher-dimensional feature space.