Grokking (machine learning)

Grokking (machine learning)

In machine learning, grokking, or delayed generalization, is a phenomenon observed in some settings where a model abruptly transitions from overfitting (performing well only on training data) to generalizing (performing well on both training and test data), after many training iterations with little or no improvement on the held-out data. This contrasts with what is typically observed in machine learning, where generalization occurs gradually alongside improved performance on training data. == Origin == Grokking was introduced by OpenAI researcher Alethea Power and colleagues in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". It is derived from the word grok coined by Robert Heinlein in his novel Stranger in a Strange Land. In ML research, "grokking" is not used as a synonym for "generalization"; rather, it names a sometimes-observed delayed‑generalization training phenomenon in which training and held‑out performance do not improve in tandem, and in which held‑out performance rises abruptly later. Authors also analyze the "grokking time", the epoch or step at which this transition occurs in those scenarios. == Interpretations == Grokking can be understood as a phase transition during the training process. In particular, recent work has shown that grokking may be due to a complexity phase transition in the model during training. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research. One potential explanation is that the weight decay (a component of the loss function that penalizes higher values of the neural network parameters, also called regularization) slightly favors the general solution that involves lower weight values, but that is also harder to find. According to Neel Nanda, the process of learning the general solution may be gradual, even though the transition to the general solution occurs more suddenly later. Recent theories have hypothesized that grokking occurs when neural networks transition from a "lazy training" regime where the weights do not deviate far from initialization, to a "rich" regime where weights abruptly begin to move in task-relevant directions. Follow-up empirical and theoretical work has accumulated evidence in support of this perspective, and it offers a unifying view of earlier work as the transition from lazy to rich training dynamics is known to arise from properties of adaptive optimizers, weight decay, initial parameter weight norm, and more. This perspective is complementary to a unifying "pattern learning speeds" framework that links grokking and double descent; within this view, delayed generalization can arise across training time ("epoch‑wise") or across model size ("model‑wise"), and the authors report "model‑wise grokking".

Reciprocal human machine learning

Reciprocal Human Machine Learning (RHML) is an interdisciplinary approach to designing human-AI interaction systems. RHML aims to enable continual learning between humans and machine learning models by having them learn from each other. This approach keeps the human expert "in the loop" to oversee and enhance machine learning performance and simultaneously support the human expert continue learning. == Background == RHML emerged in the context of the rise of big data analytics and artificial intelligence for intelligent tasks like sense-making and decision-making. As machine learning advanced to take on more roles, researchers realized fully autonomous systems had limitations and needed human guidance. RHML extends the concept of human-in-the-loop systems by promoting reciprocal learning. Humans learn from their interactions with machine learning models, staying up-to-date on evolving technology. The models also learn from human feedback and oversight. This amplification of learning on both sides is a key focus of RHML. The approach draws on theories of learning in dyads from education and psychology. It also builds on human-computer interaction and human-centered design principles. Implementing RHML requires developing specialized tools and interfaces tailored to the application == Applications == RHML has been explored across diverse domains including: Cybersecurity - Software to enable reciprocal learning between experts and AI models for social media threat detection. Organizational decision-making - RHML to structure collaboration between humans and AI systems. Workplace training - Using RHML for workers to learn from AI technologies on the job. Open science - Using human and AI collaboration to promote open science. Production and logistics - turning workers and intelligent machines into teammates. RHML maintains human oversight and control over AI systems, while enabling cutting-edge machine learning performance. This collaborative approach highlights the importance of keeping the human expert involved in the loop. An example of RHML in application is Free Spirit (AFSFCV), an open-source architecture first published in early 2025 as a whitepaper, proposing a visually structured approach to intent-based human–AI interaction.

Framework Convention on Artificial Intelligence

The Framework Convention on Artificial Intelligence and Human Rights, Democracy and the Rule of Law (also called Framework Convention on Artificial Intelligence or AI convention) is an international treaty on artificial intelligence. It was adopted under the auspices of the Council of Europe (CoE) and signed on 5 September 2024. The treaty aims to ensure that the development and use of AI technologies align with fundamental human rights, democratic values, and the rule of law, addressing risks such as misinformation, algorithmic discrimination, and threats to public institutions. More than 50 countries, including the EU member states, have endorsed the Framework Convention on Artificial Intelligence. == Background == The development of the Framework Convention on AI emerged in response to growing concerns over the ethical, legal, and societal impacts of artificial intelligence. The Council of Europe, which has historically played a key role in setting human rights standards across Europe, initiated discussions on AI governance in 2020, leading to the drafting of a binding legal framework. The process of creating the Framework Convention began in 2019 with the ad hoc Committee on Artificial Intelligence (CAHAI) assessing the feasibility of the instrument. In 2022, the Committee on Artificial Intelligence (CAI) took over the process, drafting and negotiating the text of the Convention. The treaty is designed to complement existing international human rights instruments, including the European Convention on Human Rights and the Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data. == Structure and content == The Convention establishes fundamental principles for AI governance, including transparency, accountability, non-discrimination, and human rights protection through eight chapters and 26 articles. Adopted in 2024, this landmark treaty addresses AI governance through seven core principles and detailed implementation mechanisms. It mandates risk and impact assessments to mitigate potential harms and provides safeguards such as the right to challenge AI-driven decisions. It applies to public authorities and private entities acting on their behalf but excludes national security and defense activities. Implementation is overseen by a Conference of the Parties, ensuring compliance and international cooperation. Activities within the AI system lifecycle must adhere to seven fundamental principles, ensuring compliance with human rights, democracy, and the rule of law. The treaty also establishes remedies, procedural rights and safeguards, and risk and impact management requirements to promote accountability, transparency, and responsible AI development. The treaty consists of five chapters. Chapter I contains general provisions. Chapter II states the general obligation to protect human rights and the integrity of democratic processes and respect of the rule of law. The main principles and rights are contained in Chapter III, which consists of Articles 6 to 13. Chapter IV (Articles 14 to 15) sets up the legal remedies. Chapter V states the risk and impact management framework. Chapter VI facilitates the implementation criteria of the treaty. Chapter VII sets the co-operation and oversight mechanisms. Chapter VIII contains various concluding clauses. Article 1 declares the objectives of the treaty, to ensure that activities within the lifecycle of artificial intelligence systems are fully consistent with human rights, democracy and the rule of law. == Entry into force == The treaty will enter into force on the first day of the month following the expiration of a period of three months after the date on which five ratification made by five countries, including three member states of the Council of Europe. == Competing approaches == While the CoE's AI Convention represents a multilateral effort to regulate AI through a human rights-based approach, alternative frameworks have also been proposed. One notable example is the Munich Draft for a Convention on AI, Data and Human Rights, an initiative led by legal scholars and policymakers in Germany. The Munich Draft advocates for stronger safeguards against AI-related risks, emphasizing stricter data protection measures, accountability for AI developers, and explicit prohibitions on high-risk AI applications, such as mass surveillance and autonomous lethal weapons. Unlike the CoE convention, which focuses on balancing innovation with regulation, the Munich Draft takes a more precautionary stance, calling for tighter controls over AI deployment in sensitive domains. Other competing international efforts include the OECD’s AI Principles, the GPAI (Global Partnership on AI), and the European Union's AI Act, each of which offers different regulatory strategies to govern AI at regional and global levels. == Signatories == Signatories include Andorra, Canada, the European Union, Georgia, Iceland, Israel, Japan, Liechtenstein, the Republic of Moldova, Montenegro, Norway, San Marino, Switzerland, Ukraine, the United Kingdom, the United States, and Uruguay. == Endorsement == The treaty was widely endorsed by leading AI policy experts, including Stuart J. Russell, Virginia Dignum, Emma Ruttkamp-Bloem, Pascal Pichonnaz, Maria Helen Murphy, Angella Ndaka, Hannes Werthner, Katja Langenbucher, Gry Hasselbalch, Ricardo Baeza-Yates, Kutoma Wakunuma, Gianclaudio Malgieri, Oreste Pollicino, Nagla Rizk, Giovanni Sartor, Lee Tiedrich, Ingrid Schneider, Eduardo Bertoni, Garry Kasparov, Merve Hikcok, and Marc Rotenberg. The treaty was also endorsed by notable political leaders, including Theodoros Roussopoulos, President of the Parliamentart Assembly in the Council of Europe, and Christopher Holmes, Member of the House of Lords of the United Kingdom, and by the International Bar Association (IBA), and personally by Almudena Arpón de Mendívil, President of the IBA. The Center for AI and Digital Policy (CAIDP) has been carrying out a campaign to promote endorsement of the treaty by urging various countries to sign and ratify the treaty. The CAIDP further urged the countries to make a clear and firm commitment to ensure the full inclusion of the private sector under the treaty’s provisions.

Semantic triple

A semantic triple, or RDF triple or simply triple, is the atomic data entity in the Resource Description Framework (RDF) data model. As its name indicates, a triple is a sequence of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions (e.g., "Bob is 35", or "Bob knows John"). == Subject, predicate and object == This format enables knowledge to be represented in a machine-readable way. Particularly, every part of an RDF triple is individually addressable via unique URIs—for example, the statement "Bob knows John" might be represented in RDF as: http://example.name#BobSmith12 http://xmlns.com/foaf/spec/#term_knows http://example.name#JohnDoe34. Given this precise representation, semantic data can be unambiguously queried and reasoned about. The components of a triple, such as the statement "The sky has the color blue", consist of a subject ("the sky"), a predicate ("has the color"), and an object ("blue"). This is similar to the classical notation of an entity–attribute–value model within object-oriented design, where this example would be expressed as an entity (sky), an attribute (color) and a value (blue). From this basic structure, triples can be composed into more complex models, by using triples as objects or subjects of other triples—for example, Mike → said → (triples → can be → objects). Given their particular, consistent structure, a collection of triples is often stored in purpose-built databases called triplestores. == Difference from relational databases == A relational database is the classical form for information storage, working with different tables, which consist of rows. The query language SQL is able to retrieve information from such a database. In contrast, RDF triple storage works with logical predicates. No tables nor rows are needed, but the information is stored in a text file. An RDF-triple store can be converted into an SQL database and the other way around. If the knowledge is highly unstructured and dedicated tables aren't flexible enough, semantic triples are used over classic relational storage. In contrast to a traditional SQL database, an RDF triple store isn't created with a table editor. The preferred tool is a knowledge editor, for example Protégé. Protégé looks similar to an object-oriented modeling application used for software engineering, but it's focused on natural language information. The RDF triples are aggregated into a knowledge base, which allows external parsers to run requests. Possible applications include the creation of non-player characters within video games. == Limitations == One concern about triple storage is its lack of database scalability. This problem is especially pertinent if millions of triples are stored and retrieved in a database. The seek time is larger than for classical SQL-based databases. A more complex issue is a knowledge model's inability to predict future states. Even if all the domain knowledge is available as logical predicates, the model fails in answering what-if questions. For example, suppose in the RDF format a room with a robot and table is described. The robot knows what the location of the table is, is aware of the distance to the table and knows also that a table is a type of furniture. Before the robot can plan its next action, it needs temporal reasoning capabilities. Thus, the knowledge model should answer hypothetical questions in advance before an action is taken.

OpenVINO

OpenVINO is an open-source software toolkit developed by Intel for optimizing and deploying deep learning models. It supports several popular model formats and categories, such as large language models, computer vision, and generative AI. OpenVINO is optimized for Intel hardware, but offers support for ARM/ARM64 processors. It sees great use in AI Sound Processing drivers when tied with Intel's Gaussian & Neural Accelerator (GNA). Based in C++, it extends API support for C and Python, as well as Node.js (in early preview). OpenVINO is cross-platform and free for use under Apache License 2.0. == Workflow == The simplest OpenVINO usage involves obtaining a model and running it as is. Yet for the best results, a more complete workflow is suggested: obtain a model in one of supported frameworks, convert the model to OpenVINO IR using the OpenVINO Converter tool, optimize the model, using training-time or post-training options provided by OpenVINO's NNCF. execute inference, using OpenVINO Runtime by specifying one of several inference modes. == OpenVINO model format == OpenVINO IR is the default format used to run inference. It is saved as a set of two files, .bin and .xml, containing weights and topology, respectively. It is obtained by converting a model from one of the supported frameworks, using the application's API or a dedicated converter. Models of the supported formats may also be used for inference directly, without prior conversion to OpenVINO IR. Such an approach is more convenient but offers fewer optimization options and lower performance, since the conversion is performed automatically before inference. Some pre-converted models can be found in the Hugging Face repository. The supported model formats are: PyTorch TensorFlow TensorFlow Lite ONNX (including formats that may be serialized to ONNX) PaddlePaddle JAX/Flax == OS support == OpenVINO runs on Windows, Linux and MacOS.

Cooliris (plugin)

Cooliris (for Desktop), formerly known as PicLens, was a web browser extension developed by Cooliris, Inc, and later acquired by Yahoo. The plugin provides an interactive 3D-like experience for viewing digital images and videos from the web and from desktop applications. The software places a small icon atop image thumbnails that appear on a webpage. Clicking on the icon loads the Cooliris 3D Wall, a browsing environment that gives the user the effect of flying through a three-dimensional space. Released to the public in January 2008, The New York Times described Cooliris as the "new immersive approach to Web navigation". Cooliris went out to win the 2008 Crunchies Award for Best Design. The plugin has received over 50 million downloads. As of May 2014 browser plugins are unavailable from the official website. There are only links to tablet apps - for iOS and Android.

Linear belief function

Linear belief functions are an extension of the Dempster–Shafer theory of belief functions to the case when variables of interest are continuous. Examples of such variables include financial asset prices, portfolio performance, and other antecedent and consequent variables. The theory was originally proposed by Arthur P. Dempster in the context of Kalman Filters and later was elaborated, refined, and applied to knowledge representation in artificial intelligence and decision making in finance and accounting by Liping Liu. == Concept == A linear belief function intends to represent our belief regarding the location of the true value as follows: We are certain that the truth is on a so-called certainty hyperplane but we do not know its exact location; along some dimensions of the certainty hyperplane, we believe the true value could be anywhere from –∞ to +∞ and the probability of being at a particular location is described by a normal distribution; along other dimensions, our knowledge is vacuous, i.e., the true value is somewhere from –∞ to +∞ but the associated probability is unknown. A belief function in general is defined by a mass function over a class of focal elements, which may have nonempty intersections. A linear belief function is a special type of belief function in the sense that its focal elements are exclusive, parallel sub-hyperplanes over the certainty hyperplane and its mass function is a normal distribution across the sub-hyperplanes. Based on the above geometrical description, Shafer and Liu propose two mathematical representations of a LBF: a wide-sense inner product and a linear functional in the variable space, and as their duals over a hyperplane in the sample space. Monney proposes still another structure called Gaussian hints. Although these representations are mathematically neat, they tend to be unsuitable for knowledge representation in expert systems. == Knowledge representation == A linear belief function can represent both logical and probabilistic knowledge for three types of variables: deterministic such as an observable or controllable, random whose distribution is normal, and vacuous on which no knowledge bears. Logical knowledge is represented by linear equations, or geometrically, a certainty hyperplane. Probabilistic knowledge is represented by a normal distribution across all parallel focal elements. In general, assume X is a vector of multiple normal variables with mean μ and covariance Σ. Then, the multivariate normal distribution can be equivalently represented as a moment matrix: M ( X ) = ( μ Σ ) . {\displaystyle M(X)=\left({\begin{array}{{20}c}\mu \\\Sigma \end{array}}\right).} If the distribution is non-degenerate, i.e., Σ has a full rank and its inverse exists, the moment matrix can be fully swept: M ( X → ) = ( μ Σ − 1 − Σ − 1 ) {\displaystyle M({\vec {X}})=\left({\begin{array}{{20}c}\mu \Sigma ^{-1}\\-\Sigma ^{-1}\end{array}}\right)} Except for normalization constant, the above equation completely determines the normal density function for X. Therefore, M ( X → ) {\displaystyle M({\vec {X}})} represents the probability distribution of X in the potential form. These two simple matrices allow us to represent three special cases of linear belief functions. First, for an ordinary normal probability distribution M(X) represents it. Second, suppose one makes a direct observation on X and obtains a value μ. In this case, since there is no uncertainty, both variance and covariance vanish, i.e., Σ = 0. Thus, a direct observation can be represented as: M ( X ) = ( μ 0 ) {\displaystyle M(X)=\left({\begin{array}{{20}c}\mu \\0\end{array}}\right)} Third, suppose one is completely ignorant about X. This is a very thorny case in Bayesian statistics since the density function does not exist. By using the fully swept moment matrix, we represent the vacuous linear belief functions as a zero matrix in the swept form follows: M ( X → ) = [ 0 0 ] {\displaystyle M({\vec {X}})=\left[{\begin{array}{{20}c}0\\0\end{array}}\right]} One way to understand the representation is to imagine complete ignorance as the limiting case when the variance of X approaches to ∞, where one can show that Σ−1 = 0 and hence M ( X → ) {\displaystyle M({\vec {X}})} vanishes. However, the above equation is not the same as an improper prior or normal distribution with infinite variance. In fact, it does not correspond to any unique probability distribution. For this reason, a better way is to understand the vacuous linear belief functions as the neutral element for combination (see later). To represent the remaining three special cases, we need the concept of partial sweeping. Unlike a full sweeping, a partial sweeping is a transformation on a subset of variables. Suppose X and Y are two vectors of normal variables with the joint moment matrix: M ( X , Y ) = [ μ 1 Σ 11 Σ 21 μ 2 Σ 12 Σ 22 ] {\displaystyle M(X,Y)=\left[{\begin{array}{{20}c}{\begin{array}{{20}c}\mu _{1}\\\Sigma _{11}\\\Sigma _{21}\end{array}}&{\begin{array}{{20}c}\mu _{2}\\\Sigma _{12}\\\Sigma _{22}\end{array}}\end{array}}\right]} Then M(X, Y) may be partially swept. For example, we can define the partial sweeping on X as follows: M ( X → , Y ) = [ μ 1 ( Σ 11 ) − 1 − ( Σ 11 ) − 1 Σ 21 ( Σ 11 ) − 1 μ 2 − μ 1 ( Σ 11 ) − 1 Σ 12 ( Σ 11 ) − 1 Σ 12 Σ 22 − Σ 21 ( Σ 11 ) − 1 Σ 12 ] {\displaystyle M({\vec {X}},Y)=\left[{\begin{array}{{20}c}{\begin{array}{{20}c}\mu _{1}(\Sigma _{11})^{-1}\\-(\Sigma _{11})^{-1}\\\Sigma _{21}(\Sigma _{11})^{-1}\end{array}}&{\begin{array}{{20}c}\mu _{2}-\mu _{1}(\Sigma _{11})^{-1}\Sigma _{12}\\(\Sigma _{11})^{-1}\Sigma _{12}\\\Sigma _{22}-\Sigma _{21}(\Sigma _{11})^{-1}\Sigma _{12}\end{array}}\end{array}}\right]} If X is one-dimensional, a partial sweeping replaces the variance of X by its negative inverse and multiplies the inverse with other elements. If X is multidimensional, the operation involves the inverse of the covariance matrix of X and other multiplications. A swept matrix obtained from a partial sweeping on a subset of variables can be equivalently obtained by a sequence of partial sweepings on each individual variable in the subset and the order of the sequence does not matter. Similarly, a fully swept matrix is the result of partial sweepings on all variables. We can make two observations. First, after the partial sweeping on X, the mean vector and covariance matrix of X are respectively μ 1 ( Σ 11 ) − 1 {\displaystyle \mu _{1}(\Sigma _{11})^{-1}} and − ( Σ 11 ) − 1 {\displaystyle -(\Sigma _{11})^{-1}} , which are the same as that of a full sweeping of the marginal moment matrix of X. Thus, the elements corresponding to X in the above partial sweeping equation represent the marginal distribution of X in potential form. Second, according to statistics, μ 2 − μ 1 ( Σ 11 ) − 1 Σ 12 {\displaystyle \mu _{2}-\mu _{1}(\Sigma _{11})^{-1}\Sigma _{12}} is the conditional mean of Y given X = 0; Σ 22 − Σ 21 ( Σ 11 ) − 1 Σ 12 {\displaystyle \Sigma _{22}-\Sigma _{21}(\Sigma _{11})^{-1}\Sigma _{12}} is the conditional covariance matrix of Y given X = 0; and ( Σ 11 ) − 1 Σ 12 {\displaystyle (\Sigma _{11})^{-1}\Sigma _{12}} is the slope of the regression model of Y on X. Therefore, the elements corresponding to Y indices and the intersection of X and Y in M ( X → , Y ) {\displaystyle M({\vec {X}},Y)} represents the conditional distribution of Y given X = 0. These semantics render the partial sweeping operation a useful method for manipulating multivariate normal distributions. They also form the basis of the moment matrix representations for the three remaining important cases of linear belief functions, including proper belief functions, linear equations, and linear regression models. === Proper linear belief functions === For variables X and Y, assume there exists a piece of evidence justifying a normal distribution for variables Y while bearing no opinions for variables X. Also, assume that X and Y are not perfectly linearly related, i.e., their correlation is less than 1. This case involves a mix of an ordinary normal distribution for Y and a vacuous belief function for X. Thus, we represent it using a partially swept matrix as follows: M ( X → , Y ) = [ 0 0 0 μ 2 0 Σ 22 ] {\displaystyle M({\vec {X}},Y)=\left[{\begin{array}{{20}c}{\begin{array}{{20}c}0\\0\\0\end{array}}&{\begin{array}{{20}c}\mu _{2}\\0\\\Sigma _{22}\\\end{array}}\end{array}}\right]} This is how we could understand the representation. Since we are ignorant on X, we use its swept form and set μ 1 ( Σ 11 ) − 1 = 0 {\displaystyle \mu _{1}(\Sigma _{11})^{-1}=0} and − ( Σ 11 ) − 1 = 0 {\displaystyle -(\Sigma _{11})^{-1}=0} . Since the correlation between X and Y is less than 1, the regression coefficient of X on Y approaches to 0 when the variance of X approaches to ∞. Therefore, ( Σ 11 ) − 1 Σ 12 = 0 {\displaystyle (\Sigma _{11})^{-1}\Sigma _{12}=0} . Similarly, one can prove that μ 1 ( Σ 11 ) − 1 Σ 12 = 0 {\displaystyle \mu _{1}(\Sigma _{11})^{-1}\Sigma _{12}=0} and Σ 21 ( Σ 11 ) −