AI App Editing Video

AI App Editing Video — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, and aesthetic ranking. == Algorithm == The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart. To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of N {\displaystyle N} image-caption pairs. Let the outputs from the text and image models be respectively v 1 , . . . , v N , w 1 , . . . , w N {\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}} . Two vectors are considered "similar" if their dot product is large. The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric cross-entropy loss over similarity scores: − 1 N ∑ i ln ⁡ e v i ⋅ w i / T ∑ j e v i ⋅ w j / T − 1 N ∑ j ln ⁡ e v j ⋅ w j / T ∑ i e v i ⋅ w j / T {\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}} In essence, this loss function encourages the dot product between matching image and text vectors ( v i ⋅ w i {\displaystyle v_{i}\cdot w_{i}} ) to be high, while discouraging high dot products between non-matching pairs. The parameter T > 0 {\displaystyle T>0} is the temperature, which is parameterized in the original CLIP model as T = e − τ {\displaystyle T=e^{-\tau }} where τ ∈ R {\displaystyle \tau \in \mathbb {R} } is a learned parameter. Other loss functions are possible. For example, Sigmoid CLIP (SigLIP) proposes the following loss function: L = 1 N ∑ i , j ∈ 1 : N f ( ( 2 δ i , j − 1 ) ( e τ w i ⋅ v j + b ) ) {\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))} where f ( x ) = ln ⁡ ( 1 + e − x ) {\displaystyle f(x)=\ln(1+e^{-x})} is the negative log sigmoid loss, and the Dirac delta symbol δ i , j {\displaystyle \delta _{i,j}} is 1 if i = j {\displaystyle i=j} else 0. == CLIP models == While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well. === Image model === The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order. Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt (in the OpenCLIP model series by LAION). Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024, and for the ViTs, from 512 to 768. Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm. Its implementation of ResNet was the same as the original one, with 3 modifications: In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by. There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it rect-2 blur pooling according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing. The final convolutional layer is followed by a multiheaded attention pooling. ALIGN a model with similar capabilities, trained by researchers from Google used EfficientNet, a kind of convolutional neural network. === Text model === The text encoding models used in CLIP are typically Transformers. In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as GPT-2. Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408. ALIGN used BERT of various sizes. == Dataset == === WebImageText === The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets. The dataset is private and has not been released to the public, and there is no further information on it. ==== Data preprocessing ==== For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711]. The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. If the input image does not have the same resolution as the native resolution (224×224 for all except ViT-L/14@336px, which has 336×336 resolution), then the input image is first scaled by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out. === Others === ALIGN used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering. Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. == Training == In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs. All ViT models were trained on 224×224 image resolution. The ViT-L/14 was then boosted to 336×336 resolution by FixRes, resulting in a model. They found this was the best-performing model. In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. == Applications == === Cross-modal retrieval === CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations. In text-to-image retrieval, users input descriptive text, and CLIP retrieves images with matching embeddings. In image-to-text retrieval, images are used to find related text content. CLIP’s ability to connect vis

    Read more →
  • Ideonomy

    Ideonomy

    Ideonomy is a combinatorial "science of ideas" developed by American independent scholar Patrick M. Gunkel (1947–2017). Specifically, Ideonomy is concerned with the systematic organization of ideas and the discovery of the rules behind how ideas combine, diverge, and transform. Gunkel defined ideonomy as "the science of the laws of ideas and of the application of such laws to the generation of all possible ideas in connection with any subject, idea, or thing." In his 1992 book A History of Knowledge, Charles Van Doren compared ideonomy to a "mining operation" that excavates meanings and thought to discover treasures hidden deep within language. Sources from the 1980s and 1990s demonstrate that ideonomy was useful to academic researchers in fields including biology, toxicology, and nursing/patient care. Beginning in the 2010s, academics in a wide range of fields including machine learning, marketing, computational modeling, and cybersecurity have relied on materials generated for ideonomy to provide methodological support for their research. == Etymology and definition == The word "ideonomy" combines the Greek roots ideo- (from idea, meaning pattern or form) and -nomy (from nomos, meaning law or custom). The suffix -nomy suggests the laws concerning or the totality of knowledge about a given subject, as in astronomy or taxonomy. In a note posted on the MIT ideonomy website, Gunkel states that the word was supposedly first coined by the French Encyclopedists to refer to a science of ideas. No evidence is provided for this statement, however. The concept bears some relationship to Antoine Destutt de Tracy's "ideology" (1796), which originally meant a systematic science of ideas before acquiring its modern political connotations. Gunkel provided several metaphorical descriptions of ideonomy: An "idea bank": a computer network enabling systematic exploration of infinite possible ideas A "kaleidoscope" that can exhibit all possible combinations and transformations of ideas A "prism" capable of diffracting any idea into its cognitive components A "gigantic microscope for magnifying the ideocosm" == History and development == In 1984, Gunkel received a five-year unsolicited grant from the Richard Lounsbery Foundation of New York to develop ideonomy. A June 1, 1987 article on the front page of The Wall Street Journal brought Gunkel and ideonomy to wider public attention. Some academics were interested in using ideonomy's techniques, including biologist Betsey Dyer, who published several contemporaneous peer-reviewed studies citing ideonomy. Academic researchers in the field of toxicology and nursing/patient care also used ideonomy. However, ideonomy's broadest contribution to date came beginning in the 2010s, as a list of personality traits generated for combinatorial matching was used by researchers in artificial intelligence to code human emotions for machine-learning tasks, develop computational models related to personality, develop a measurement framework for influencer-brand recommender systems, and aid information awareness/cybersecurity assessment. == Methodology == The foundational empirical method of ideonomy involves the systematic creation of extensive lists. Gunkel's apartment reportedly contained thousands of lists on every conceivable topic. Gunkel termed each list an "organon," which he described as expanding through "combination, permutation, transformation, generalization, specialization, intersection, interaction, reapplication, recursive use, etc. of existing organons." The ideonomic process follows a progressive structure. The ideonomist begins with a simple list of examples of a particular idea, concept, or thing. The list need not be exhaustive. By studying this list, the ideonomist isolates and identifies types. This categorical analysis then reveals missing items, allowing the primary list to be improved and refined. Gunkel emphasized that list items must not only cover genuine categories of nature but also be formulated in ways that yield the largest possible number of syntactically coherent possibilities when combined. The core technique of ideonomy is "ideocombinatorics"—the systematic intersection and combination of items from different lists to generate novel composite concepts. Gunkel developed computer programs to automate this process. For example, combining a list of 230 Universal Elementary Shapes (pits, pyramids, trenches, hemispheres, needles) with a list of 74 Types of Order (recurrence, identity, likeness of parts) yields 17,020 possible "shapes of order." These combinations, when phrased as questions ("Can there be pits of recurrence?"), could suggest new categories of phenomena worthy of investigation. The computer-generated output is typically repetitive and often meaningless. However, with sufficient frequency, the combinations yield results that are unexpectedly interesting and fruitful. In one documented case, Gunkel's programs generated 45,540 questions about toxins for microbiologist David Bermudes. One question—"Can hierarchies of cell process be used as a basis for classifying toxic action?"—prompted Bermudes to develop a novel approach to classifying biological toxins by the type of molecule they attack, rather than by chemical structure or physiological system affected. According to one contemporaneous account of ideonomy, "Gunkel takes for his field all fields and all ideas about anything. He uses a computer to generate lists of words and phrases and by juxtaposition reviews the resultant patterns for novel ideas. The computer is ideal for this task because the mind would rebel at the formidable processing task ideonomy involves. What we have here is computer generated originality." == Applications == Gunkel and his supporters identified several practical applications for ideonomic methods: Scientific research: Biologist Betsey Dyer of Wheaton College published research crediting ideonomy for helping to generate ideas. Medical science: When Austin pathologist Michael T. O'Brien was presented with the ideonomically-generated question "Can arteries have rashes?", he initially dismissed it as nonsense. Upon reflection, he realized that large arteries are supplied with blood by tiny vessels that might become inflamed and dilated, analogous to skin vessels in a rash—a phenomenon potentially worth researching. Analogical thinking: Harvard law professor Robert Clark used ideonomic analogies to write a research paper comparing plant structure with human hierarchies. Artificial intelligence: Douglas Lenat, a researcher at Microelectronics and Computer Technology Corporation (MCC) in Austin, suggested that Gunkel's lists enumerating types of human mistakes could help design AI systems capable of recognizing and correcting their own errors. == Reception and criticism == Ideonomy received mixed reactions from the academic and scientific communities. Prominent supporters included: Edward Fredkin, former director of MIT's computer science laboratory, who praised Gunkel's "provocative ideas on artificial intelligence." Marvin Minsky, AI scientist and MIT professor, who described ideonomy as "perhaps the most extensive study of ways to generate ideas." Frederick Seitz, president emeritus of Rockefeller University, who noted Gunkel's "encyclopedic scope" Robert C. Clark, Harvard law professor, who called Gunkel "the most intelligent person I ever met" However, skeptics questioned whether ideonomy constituted a genuine science. Fredkin himself noted that Gunkel "pours out about 60 ideas a minute, and 59 of them are bad," though he added that "even with one good idea out of 60, it's still an amazing accomplishment." Douglas Lenat observed that brainstorming with Gunkel was "a bit like being hit over the head by the muse with a sledgehammer" and that "he puts people off." Gunkel himself acknowledged that ideonomy was in its infancy and might seem "absurdly utopian." His planned magnum opus on ideonomy remained incomplete, and was posted on an MIT website thanks to faculty advisor Whitman Richards. Gunkel wrote: "Pioneering in a completely new field, yes in a new science, is almost unreal. It is heartbreaking, it is pitiable, it is almost inhuman. Honestly, it is a hell. There is nothing heroic about it." == Related concepts == Gunkel identified several historical precedents for ideonomic thinking: Gottfried Wilhelm Leibniz (1646–1716): The philosopher's work on a universal characteristic (characteristica universalis) and calculus of reasoning Peter Mark Roget (1779–1869): Creator of Roget's Thesaurus, which organized concepts into a systematic taxonomy Dmitri Mendeleev (1834–1907): Developer of the periodic table, demonstrating how combining lists of element families could reveal previously unseen connections Fritz Zwicky (1898–1974): The Caltech astrophysicist whom Gunkel called the "grandfather of ideonomy" for his development of "morphological research"—systematic exploration of all possible solutions t

    Read more →
  • Artificial general intelligence

    Artificial general intelligence

    Artificial general intelligence (AGI) is a hypothetical type of artificial intelligence that matches or surpasses human capabilities across virtually all cognitive tasks. Beyond AGI, artificial superintelligence (ASI) would outperform the best human abilities across every domain by a wide margin. Unlike artificial narrow intelligence (ANI), whose competence is confined to well‑defined tasks, an AGI system can generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming. Creating AGI is a stated goal of technology companies such as OpenAI, Google, xAI, and Meta. A 2020 survey identified 72 active AGI research and development projects across 37 countries. AGI is a common topic in science fiction and futures studies. Contention exists over whether AGI represents an existential risk. Some AI experts and industry figures have stated that mitigating the risk of human extinction posed by AGI should be a global priority. Others find the development of AGI to be in too remote a stage to present such a risk. == Terminology == AGI is also known as strong AI, full AI, human-level AI, human-level intelligent AI, or general intelligent action. The term "artificial general intelligence" was used in 1997 by Mark Gubrud in a discussion of the implications of fully automated military production and operations. A mathematical formalism of AGI named AIXI was proposed in 2000 by Marcus Hutter, who defines intelligence as "an agent’s ability to achieve goals or succeed in a wide range of environments". This type of AGI has also been called "universal artificial intelligence". The term AGI was re-introduced and popularized by Shane Legg and Ben Goertzel around 2002. Some academic sources reserve the term "strong AI" for computer programs that will experience sentience or consciousness. In contrast, weak AI (or narrow AI) can solve a specific problem but lacks general cognitive abilities. Some academic sources use "weak AI" to refer more broadly to any programs that neither experience consciousness nor have a mind in the same sense as humans. Related concepts include artificial superintelligence and transformative AI. An artificial superintelligence (ASI) is a hypothetical type of AGI that is much more generally intelligent than humans, while the notion of transformative AI relates to AI having a large impact on society, for example, similar to the agricultural or industrial revolution. A framework for classifying AGI was proposed in 2023 by Google DeepMind researchers. They define five performance levels of AGI: emerging, competent, expert, virtuoso, and superhuman. For example, a competent AGI is defined as an AI that outperforms 50% of skilled adults in a wide range of non-physical tasks, and a superhuman AGI (i.e., an artificial superintelligence) is similarly defined but with a threshold of 100%. They consider large language models like ChatGPT or LLaMA 2 to be instances of emerging AGI (comparable to unskilled humans). Regarding the autonomy of AGI and associated risks, they define five levels: tool (fully in human control), consultant, collaborator, expert, and agent (fully autonomous). == Characteristics == There is no single agreed-upon definition of intelligence as applied to computers. Computer scientist John McCarthy wrote in 2007: "We cannot yet characterize in general what kinds of computational procedures we want to call intelligent." === Intelligence traits === Researchers generally hold that a system is required to do all of the following to be regarded as an AGI: reason, use strategy, solve puzzles, and make judgments under uncertainty, represent knowledge, including common sense knowledge, plan, learn, communicate in natural language, if necessary, integrate these skills in completion of any given goal. Many interdisciplinary approaches (e.g. cognitive science, computational intelligence, and decision making) consider additional traits such as imagination (the ability to form novel mental images and concepts) and autonomy. Computer-based systems exhibiting these capabilities are now widespread, with modern large language models demonstrating computational creativity, automated reasoning, and decision support simultaneously across domains. === Physical traits === Other capabilities are considered desirable in intelligent systems, as they may affect intelligence or aid in its expression. These include: the ability to sense (e.g. see, hear, etc.), and the ability to act (e.g. move and manipulate objects, change location to explore, etc.) This includes the ability to detect and respond to hazard. === Tests for human-level AGI === Several tests meant to confirm human-level AGI have been considered. ==== Turing test ==== The Turing test was proposed by Alan Turing in his 1950 paper "Computing Machinery and Intelligence". This test involves a human judge engaging in natural language conversations with both a human and a machine designed to generate human-like responses. The machine passes the test if it can convince the judge that it is human a significant fraction of the time. Turing proposed this as a practical measure of machine intelligence, focusing on the ability to produce human-like responses rather than on the internal workings of the machine. The idea of the test is that the machine has to try and pretend to be a man, by answering questions put to it, and it will only pass if the pretence is reasonably convincing. A considerable portion of a jury, who should not be experts about machines, must be taken in by the pretence. In 2014, a chatbot named Eugene Goostman, designed to imitate a 13-year-old Ukrainian boy, reportedly passed a Turing Test event by convincing 33% of judges that it was human. However, this claim was met with significant skepticism from the AI research community, who questioned the test's implementation and its relevance to AGI. A 2025 pre‑registered, three‑party Turing‑test study by Cameron R. Jones and Benjamin K. Bergen showed that GPT-4.5 was judged to be the human in 73% of five‑minute text conversations—surpassing the 67% humanness rate of real confederates and meeting the researchers' criterion for having passed the test. ==== Ikea test ==== The "Ikea test", also known as the Flat Pack Furniture Test, involves an AI controlling a robot which attempts to assemble an Ikea flat-pack furniture product after having been shown the parts and instructions. As early as 2013, MIT's IkeaBot demonstrated fully autonomous multi-robot assembly of an IKEA Lack table in ten minutes, with no human intervention and no pre-programmed assembly instructions. The robots inferred the assembly sequence from the geometry of the parts alone. ==== Coffee test ==== Steve Wozniak proposed a test where a machine is required to enter an average American home and figure out how to make coffee. It must find the coffee machine, find the coffee, add water, find a mug, and brew the coffee by pushing the proper buttons. This test has been substantially approached across multiple systems. In January 2024, Figure AI's Figure 01 humanoid learned to operate a Keurig coffee machine autonomously after watching video demonstrations, using end-to-end neural networks to translate visual input into motor actions. In 2025, researchers at the University of Edinburgh published the ELLMER framework in Nature Machine Intelligence, demonstrating a robotic arm that interprets verbal instructions, analyses its surroundings, and autonomously makes coffee in dynamic kitchen environments — adapting to unforeseen obstacles in real time rather than following pre-programmed sequences. ==== Suleyman's test ==== Mustafa Suleyman's test proposes giving an AI model US$100,000 and asking it to obtain US$1 million. ==== Use of video-games ==== Adams, et al. propose that the ability to learn and succeed in a wide range of video games can be used to test AI intelligence. This range would include games unknown to the AGI developers before the test is administered. === AI-complete problems === A problem is informally called "AI-complete" or "AI-hard" if it is believed that AGI would be needed to solve it, because the solution is beyond the capabilities of a purpose-specific algorithm. == History == === Classical AI === Modern AI research began in the mid-1950s. The first generation of AI researchers were convinced that artificial general intelligence was possible and that it would exist in just a few decades. AI pioneer Herbert A. Simon wrote in 1965: "machines will be capable, within twenty years, of doing any work a man can do". Their predictions were the inspiration for Stanley Kubrick and Arthur C. Clarke's fictional character HAL 9000, who embodied what AI researchers believed they could create by the year 2001. AI pioneer Marvin Minsky was a consultant on the project of making HAL 9000 as realistic as possible according to the consensus predictions of the time. He said in 1967, "Within a generation... the problem of

    Read more →
  • Personality computing

    Personality computing

    Personality computing is a research field related to artificial intelligence and personality psychology that studies personality by means of computational techniques from different sources, including text, multimedia, and social networks. == Overview == Personality computing addresses three main problems involving personality: automatic personality recognition, perception, and synthesis. Automatic personality recognition is the inference of the personality type of target individuals from their digital footprint. Automatic personality perception is the inference of the personality attributed by an observer to a target individual based on some observable behavior. Automatic personality synthesis is the generation of the style or behaviour of artificial personalities in Avatars and virtual agents. Self-assessed personality tests or observer ratings are always exploited as the ground truth for testing and validating the performance of artificial intelligence algorithms for the automatic prediction of personality types. There is a wide variety of personality tests, such as the Myers Briggs Type Indicator (MBTI) or the MMPI, but the most used are tests based on the Five Factor Model such as the Revised NEO Personality Inventory. Personality computing can be considered as an extension or complement of Affective computing, where the former focuses on personality traits and the latter on affective states. A further extension of the two fields is Character Computing which combines various character states and traits including but not limited to personality and affect. == History == Personality computing began around 2005 with the pioneering research in personality recognition by Shlomo Argamon and later by François Mairesse. These works showed that personality traits could be inferred with reasonable accuracy from text, such as blogs, self-presentations, and email addresses. In 2008, the concept of "portable personality" for the distributed management of personality profiles has been developed. A few years later, research began in personality recognition and perception from multimodal and social signals, such as recorded meetings and voice calls. In the 2010s, the research focused mainly on personality recognition and perception from social media, helped by the first workshops organized by Fabio Celli. In particular personality was extracted from Facebook, Twitter and Instagram. In the same years, automatic personality synthesis helped improve the coherence of simulated behavior in virtual agents. Scientific works by Michal Kosinski demonstrated the validity of Personality Computing from different digital footprints, in particular from user preferences such as Facebook page likes, showed that machines can recognize personality better than humans and raised a warning against Cambridge Analytica and misuse of this kind of technology. == Applications == Personality computing techniques, in particular personality recognition and perception, have applications in Social media marketing, where they can help reducing the cost of advertising campaigns through psychological targeting.

    Read more →
  • Color normalization

    Color normalization

    Color normalization is a topic in computer vision concerned with artificial color vision and object recognition. In general, the distribution of color values in an image depends on the illumination, which may vary depending on lighting conditions, cameras, and other factors. Color normalization allows for object recognition techniques based on color to compensate for these variations. == Main concepts == === Color constancy === Color constancy is a feature of the human internal model of perception, which provides humans with the ability to assign a relatively constant color to objects even under different illumination conditions. This is helpful for object recognition as well as identification of light sources in an environment. For example, humans see an object approximately as the same color when the sun is bright or when the sun is dim. === Applications === Color normalization has been used for object recognition on color images in the field of robotics, bioinformatics and general artificial intelligence, when it is important to remove all intensity values from the image while preserving color values. One example is in case of a scene shot by a surveillance camera over the day, where it is important to remove shadows or lighting changes on same color pixels and recognize the people that passed. Another example is automated screening tools used for the detection of diabetic retinopathy as well as molecular diagnosis of cancer states, where it is important to include color information during classification. == Known issues == The main issue about certain applications of color normalization is that the result looks unnatural or too distant from the original colors. In cases where there is a subtle variation between important aspects, this can be problematic. More specifically, the side effect can be that pixels become divergent and not reflect the actual color value of the image. A way of combating this issue is to use color normalization in combination with thresholding to correctly and consistently segment a colored image. == Transformations and algorithms == There is a vast array of different transformations and algorithms for achieving color normalization and a limited list is presented here. The performance of an algorithm is dependent on the task and one algorithm which performs better than another in one task might perform worse in another (no free lunch theorem). Additionally, the choice of the algorithm depends on the preferences of the user for the end-result, e.g. they may want a more natural-looking color image. === Grey world === The grey world normalization makes the assumption that changes in the lighting spectrum can be modelled by three constant factors applied to the red, green and blue channels of color. More specifically, a change in illuminated color can be modelled as a scaling α, β and γ in the R, G and B color channels and as such the grey world algorithm is invariant to illumination color variations. Therefore, a constancy solution can be achieved by dividing each color channel by its average value as shown in the following formula: ( α R , β G , γ B ) → ( α R α n ∑ i R , β G β n ∑ i G , γ B γ n ∑ i B ) {\displaystyle \left(\alpha R,\beta G,\gamma B\right)\rightarrow \left({\frac {\alpha R}{{\frac {\alpha }{n}}\sum _{i}R}},{\frac {\beta G}{{\frac {\beta }{n}}\sum _{i}G}},{\frac {\gamma B}{{\frac {\gamma }{n}}\sum _{i}B}}\right)} As mentioned above, grey world color normalization is invariant to illuminated color variations α, β and γ, however it has one important problem: it does not account for all variations of illumination intensity and it is not dynamic; when new objects appear in the scene it fails. To solve this problem there are several variants of the grey world algorithm. Additionally there is an iterative variation of the grey world normalization, however it was not found to perform significantly better. === Histogram equalization === Histogram equalization is a non-linear transform which maintains pixel rank and is capable of normalizing for any monotonically increasing color transform function. It is considered to be a more powerful normalization transformation than the grey world method. The results of histogram equalization tend to have an exaggerated blue channel and look unnatural, due to the fact that in most images the distribution of the pixel values is usually more similar to a Gaussian distribution, rather than uniform. === Histogram specification === Histogram specification transforms the red, green and blue histograms to match the shapes of three specific histograms, rather than simply equalizing them. It refers to a class of image transforms which aims to obtain images of which the histograms have a desired shape. As specified, firstly it is necessary to convert the image so that it has a particular histogram. Assume an image x. The following formula is the equalization transform of this image: y = f ( x ) = ∫ 0 x p x ( u ) d u {\displaystyle y=f(x)=\int \limits _{0}^{x}p_{x}(u)du} Then assume wanted image z. The equalization transform of this image is: y ′ = g ( z ) = ∫ 0 z p z ( u ) d u {\displaystyle y'=g(z)=\int \limits _{0}^{z}p_{z}(u)du} Of course p z ( u ) {\displaystyle p_{z}(u)} is the histogram of the output image. The formula to find the inverse of the above transform is: z = g − 1 ( y ′ ) {\displaystyle z=g^{-1}(y')} Therefore, since images y and y' have the same equalized histogram they are actually the same image, meaning y = y' and the transform from the given image x to the wanted image z is: z = g − 1 ( y ′ ) = g − 1 ( y ) = g − 1 ( f ( x ) ) {\displaystyle z=g^{-1}(y')=g^{-1}(y)=g^{-1}(f(x))} Histogram specification has the advantage of producing more realistic looking images, as it does not exaggerate the blue channel like histogram equalization. === Comprehensive Color Normalization === The comprehensive color normalization is shown to increase localization and object classification results in combination with color indexing. It is an iterative algorithm which works in two stages. The first stage is to use the red, green and blue color space with the intensity normalized, to normalize each pixel. The second stage is to normalize each color channel separately, so that the sum of the color components is equal to one third of the number of pixels. The iterations continue until convergence, meaning no additional changes. Formally: Normalize the color image f ( t ) = [ f i j ( t ) ] i = 1... N , j = 1... M {\displaystyle f^{(t)}=[f_{ij}^{(t)}]_{i=1...N,j=1...M}} which consists of color vectors f i j ( t ) = ( r i j ( t ) , g i j ( t ) , b i j ( t ) ) T . {\displaystyle f_{ij}^{(t)}=(r_{ij}^{(t)},g_{ij}^{(t)},b_{ij}^{(t)})^{T}.} For the first step explained above, compute: S i j := r i j ( t ) + g i j ( t ) + b i j ( t ) {\displaystyle S_{ij}:=r_{ij}^{(t)}+g_{ij}^{(t)}+b_{ij}^{(t)}} which leads to r i j ( t + 1 ) = r i j ( t ) S i j , g i j ( t + 1 ) = g i j ( t ) S i j {\displaystyle r_{ij}^{(t+1)}={\frac {r_{ij}^{(t)}}{S_{ij}}},g_{ij}^{(t+1)}={\frac {g_{ij}^{(t)}}{S_{ij}}}} and b i j ( t + 1 ) = b i j ( t ) S i j . {\displaystyle b_{ij}^{(t+1)}={\frac {b_{ij}^{(t)}}{S_{ij}}}.} For the second step explained above, compute: r ′ = 3 N M ∑ i = 1 N ∑ j = 1 M r i j ( t + 1 ) {\displaystyle r'={\frac {3}{NM}}\sum _{i=1}^{N}\sum _{j=1}^{M}r_{ij}^{(t+1)}} and normalize r i j ( t + 2 ) = r i j ( t + 1 ) r ′ . {\displaystyle r_{ij}^{(t+2)}={\frac {r_{ij}^{(t+1)}}{r'}}.} Of course the same process is done for b' and g'. Then these two steps are repeated until the changes between iteration t and t+2 are less than some set threshold. Comprehensive color normalization, just like the histogram equalization method previously mentioned, produces results that may look less natural due to the reduction in the number of color values.

    Read more →
  • Exploration–exploitation dilemma

    Exploration–exploitation dilemma

    The exploration–exploitation dilemma, also known as the explore–exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system (which may be incomplete or misleading), while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits. == Application in machine learning == In the context of machine learning, the exploration–exploitation tradeoff is fundamental in reinforcement learning (RL), a type of machine learning that involves training agents to make decisions based on feedback from the environment. Crucially, this feedback may be incomplete or delayed. The agent must decide whether to exploit the current best-known policy or explore new policies to improve its performance. === Multi-armed bandit methods === The multi-armed bandit (MAB) problem was a classic example of the tradeoff, and many methods were developed for it, such as epsilon-greedy, Thompson sampling, and the upper confidence bound (UCB). See the page on MAB for details. In more complex RL situations than the MAB problem, the agent can treat each choice as a MAB, where the payoff is the expected future reward. For example, if the agent performs an epsilon-greedy method, then the agent will often "pull the best lever" by picking the action that had the best predicted expected reward (exploit). However, it would pick a random action with probability epsilon (explore). Monte Carlo tree search, for example, uses a variant of the UCB method. === Exploration problems === There are some problems that make exploration difficult. Sparse reward. If rewards occur only once a long while, then the agent might not persist in exploring. Furthermore, if the space of actions is large, then the sparse reward would mean the agent would not be guided by the reward to find a good direction for deeper exploration. A standard example is Montezuma's Revenge. Deceptive reward. If some early actions give immediate small reward, but other actions give later large reward, then the agent might be lured away from exploring the other actions. Noisy TV problem. If certain observations are irreducibly noisy (such as a television showing random images), then the agent might be trapped exploring those observations (watching the television). === Exploration reward === This section based on. The exploration reward (also called exploration bonus) methods convert the exploration-exploitation dilemma into a balance of exploitations. That is, instead of trying to get the agent to balance exploration and exploitation, exploration is simply treated as another form of exploitation, and the agent simply attempts to maximize the sum of rewards from exploration and exploitation. The exploration reward can be treated as a form of intrinsic reward. We write these as r t i , r t e {\displaystyle r_{t}^{i},r_{t}^{e}} , meaning the intrinsic and extrinsic rewards at time step t {\displaystyle t} . However, exploration reward is different from exploitation in two regards: The reward of exploitation is not freely chosen, but given by the environment, but the reward of exploration may be picked freely. Indeed, there are many different ways to design r t i {\displaystyle r_{t}^{i}} described below. The reward of exploitation is usually stationary (i.e. the same action in the same state gives the same reward), but the reward of exploration is non-stationary (i.e. the same action in the same state should give less and less reward). Count-based exploration uses N n ( s ) {\displaystyle N_{n}(s)} , the number of visits to a state s {\displaystyle s} during the time-steps 1 : n {\displaystyle 1:n} , to calculate the exploration reward. This is only possible in small and discrete state space. Density-based exploration extends count-based exploration by using a density model ρ n ( s ) {\displaystyle \rho _{n}(s)} . The idea is that, if a state has been visited, then nearby states are also partly-visited. In maximum entropy exploration, the entropy of the agent's policy π {\displaystyle \pi } is included as a term in the intrinsic reward. That is, r t i = − ∑ a π ( a | s t ) ln ⁡ π ( a | s t ) + ⋯ {\displaystyle r_{t}^{i}=-\sum _{a}\pi (a|s_{t})\ln \pi (a|s_{t})+\cdots } . === Prediction-based === This section based on. The forward dynamics model is a function for predicting the next state based on the current state and the current action: f : ( s t , a t ) ↦ s t + 1 {\displaystyle f:(s_{t},a_{t})\mapsto s_{t+1}} . The forward dynamics model is trained as the agent plays. The model becomes better at predicting state transition for state-action pairs that had been done many times. A forward dynamics model can define an exploration reward by r t i = ‖ f ( s t , a t ) − s t + 1 ‖ 2 2 {\displaystyle r_{t}^{i}=\|f(s_{t},a_{t})-s_{t+1}\|_{2}^{2}} . That is, the reward is the squared-error of the prediction compared to reality. This rewards the agent to perform state-action pairs that had not been done many times. This is however susceptible to the noisy TV problem. Dynamics model can be run in latent space. That is, r t i = ‖ f ( s t , a t ) − ϕ ( s t + 1 ) ‖ 2 2 {\displaystyle r_{t}^{i}=\|f(s_{t},a_{t})-\phi (s_{t+1})\|_{2}^{2}} for some featurizer ϕ {\displaystyle \phi } . The featurizer can be the identity function (i.e. ϕ ( x ) = x {\displaystyle \phi (x)=x} ), randomly generated, the encoder-half of a variational autoencoder, etc. A good featurizer improves forward dynamics exploration. The Intrinsic Curiosity Module (ICM) method trains simultaneously a forward dynamics model and a featurizer. The featurizer is trained by an inverse dynamics model, which is a function for predicting the current action based on the features of the current and the next state: g : ( ϕ ( s t ) , ϕ ( s t + 1 ) ) ↦ a t {\displaystyle g:(\phi (s_{t}),\phi (s_{t+1}))\mapsto a_{t}} . By optimizing the inverse dynamics, both the inverse dynamics model and the featurizer are improved. Then, the improved featurizer improves the forward dynamics model, which improves the exploration of the agent. Random Network Distillation (RND) method attempts to solve this problem by teacher–student distillation. Instead of a forward dynamics model, it has two models f , f ′ {\displaystyle f,f'} . The f ′ {\displaystyle f'} teacher model is fixed, and the f {\displaystyle f} student model is trained to minimize ‖ f ( s ) − f ′ ( s ) ‖ 2 2 {\displaystyle \|f(s)-f'(s)\|_{2}^{2}} on states s {\displaystyle s} . As a state is visited more and more, the student network becomes better at predicting the teacher. Meanwhile, the prediction error is also an exploration reward for the agent, and so the agent learns to perform actions that result in higher prediction error. Thus, we have a student network attempting to minimize the prediction error, while the agent attempting to maximize it, resulting in exploration. The states are normalized by subtracting a running average and dividing a running variance, which is necessary since the teacher model is frozen. The rewards are normalized by dividing with a running variance. Exploration by disagreement trains an ensemble of forward dynamics models, each on a random subset of all ( s t , a t , s t + 1 ) {\displaystyle (s_{t},a_{t},s_{t+1})} tuples. The exploration reward is the variance of the models' predictions. === Noise === For neural network–based agents, the NoisyNet method changes some of its neural network modules by noisy versions. That is, some network parameters are random variables from a probability distribution. The parameters of the distribution are themselves learnable. For example, in a linear layer y = W x + b {\displaystyle y=Wx+b} , both W , b {\displaystyle W,b} are sampled from Gaussian distributions N ( μ W , Σ W ) , N ( μ b , Σ b ) {\displaystyle {\mathcal {N}}(\mu _{W},\Sigma _{W}),{\mathcal {N}}(\mu _{b},\Sigma _{b})} at every step, and the parameters μ W , Σ W , μ b , Σ b {\displaystyle \mu _{W},\Sigma _{W},\mu _{b},\Sigma _{b}} are learned via the reparameterization trick.

    Read more →
  • Google Research

    Google Research

    Google Research (also known as Research at Google) is the research division of Google, a subsidiary of Alphabet Inc.. According to its official website, Google Research publishes findings, releases open-source software, and applies research results within Google products and services as well as within the wider scientific community. == Notable contributions == The 2017 landmark paper Attention Is All You Need, which introduced the Transformer architecture, which has subsequently been used to build modern large language models. Advances in neural machine translation powering Google Translate. Time series forecasting. Development of scalable learning systems and infrastructure for large-model training. Flood forecasting. Research into computational discovery via Google Accelerated Science including demonstrating the first below-threshold quantum calculations.

    Read more →
  • JAX (software)

    JAX (software)

    JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. It is developed by Google with contributions from Nvidia and other community contributors. It is described as bringing together a modified version of the automatic differentiation system autograd and OpenXLA's XLA (Accelerated Linear Algebra). It is designed to follow the structure and workflow of NumPy as closely as possible and works with various existing frameworks such as TensorFlow and PyTorch. The primary features of JAX are: Providing a unified NumPy-like interface to computations that run on CPU, GPU, or TPU, in local or distributed settings. Built-in Just-In-Time (JIT) compilation via OpenXLA, an open-source machine learning compiler ecosystem. Efficient evaluation of gradients via its automatic differentiation transformations. Automatic vectorization to efficiently map functions over arrays representing batches of inputs. == Libraries using Jax == Flax Equinox Optax

    Read more →
  • Message queuing service

    Message queuing service

    A message queueing service is a message-oriented middleware or MOM deployed in a compute cloud using software as a service model. Service subscribers access queues and or topics to exchange data using point-to-point or publish and subscribe patterns. It's important to differentiate between event-driven and message-driven (aka queue driven) services: Event-driven services (e.g. AWS SNS) are decoupled from their consumers. Whereas queue / message driven services (e.g. AWS SQS) are coupled with their consumers. Message queues can be a good buffer to handle spiky workloads but they have a finite capacity. According to Gregor Hohpe, message queues require proper mechanisms (aka flow controls) to avoid filling the queue beyond its manageable capacity and to keep the system stable. == Ordering Guarantees in Message Queues == Amazon SQS FIFO and Azure Service Bus sessions are queue-based messaging systems that provide ordering guarantees within a message group or session attempt but do not necessarily guarantee ordered delivery in cases of retries or failures. In SQS FIFO, messages in the same message group are processed in order, with subsequent messages held until the preceding message is successfully processed or moved to the dead-letter queue (DLQ). Once a message is placed in the DLQ, it is no longer retried, creating a gap in the sequence. However, the remaining messages continue to be delivered in order. Azure Service Bus sessions function similarly by maintaining ordering within a session, provided a single consumer processes messages sequentially. The implementation differs from SQS FIFO but follows the same fundamental ordering principle. In contrast, Apache Kafka is a distributed log-based messaging system that guarantees ordering within individual partitions rather than across the entire topic. Unlike queue-based systems, Kafka retains messages in a durable, append-only log, allowing multiple consumers to read at different offsets. Kafka uses manual offset management, giving consumers control over retries and failure handling. If a consumer fails to process a message, it can delay committing the offset, preventing further progress in that partition while other partitions remain unaffected. This partition-based design enables fault isolation and parallel processing while allowing ordering to be maintained within partitions, depending on consumer handling. == Vendors == Apache Kafka Apache Kafka is a distributed system consisting of servers that store and forward messages between producer client and consumer applications. IBM MQ IBM MQ offers a managed service that can be used on IBM Cloud and Amazon Web Services. Microsoft Azure Service Bus Service Bus offers queues, topics & subscriptions, and rules/actions in order to support publish-subscribe, temporal decoupling, and load balancing scenarios. Azure Service Bus is built on AMQP allowing any existing AMQP 1.0 client stack to interact with Service Bus directly or via existing .Net, Java, Node, and Python clients. Standard and Premium tiers allow for pay as you go or isolated resources at massive scale. Oracle Messaging Cloud Service This service provides a messaging solution for applications for asynchronous communication and is influenced by the Java Message Service (JMS) API specification. Any application platform that understands HTTP can also use Oracle Messaging Cloud Service through the REST interface. For Java applications, Oracle Messaging Cloud Service provides a Java library that implements and extends the JMS 1.1 interface. The Java library implements the JMS API by acting as a client of the REST API. Amazon Simple Queue Service Supports messages natively up to 256K, or up to 2GB by transmitting payload via S3. Highly scalable, durable and resilient. Provides loose-FIFO and 'at least once' delivery in order to provide massive scale. Supports REST API and optional Java Message Service client. Low latency. Utilizes Amazon Web Services. IronMQ Supports messages up to 64k; guarantees order; guarantees once only delivery; no delays retrieving messages. Supports REST API and beanstalkd open source protocol. Runs on multiple clouds including AWS and Rackspace. Scaling must be managed by user. RabbitMQ RabbitMQ is a reliable and mature messaging and streaming broker, which is easy to deploy on cloud environments, on-premises, and on your local machine. Supports AMQP, STOMP, MQTT StormMQ Open platform supports messages up to 50Mb. Uses AMQP to avoid vendor lock-in and provide language neutrality. Locate-It Option allows customers to audit the location of their data at all times and satisfy data protection principles. AnypointMQ An enterprise multi-tenant, cloud messaging service that performs advanced asynchronous messaging scenarios between applications. Anypoint MQ is fully integrated with Anypoint Platform, offering role based access control, client application management, and connectors.

    Read more →
  • Operation Serenata de Amor

    Operation Serenata de Amor

    Operation Serenata de Amor is an artificial intelligence project designed to analyze public spending in Brazil. The project has been funded by a recurrent financing campaign since September 7, 2016, and came in the wake of major scandals of misappropriation of public funds in Brazil, such as the Mensalão scandal and what was revealed in the Operation Car Wash investigations. The analysis began with data from the National Congress then expanded to other types of budget and instances of government, such as the Federal Senate. The project is built through collaboration on GitHub and using a public group with more than 600 participants on Telegram. The name "Serenata de Amor," which means "serenade of love," was taken from a popular cashew cream bonbon produced by Chocolates Garoto in Brazil. == Modules == Throughout development of the project, new modules have been newly introduced in addition to the main repository: The main repository, serenata-de-amor, serves as the starting point for investigative work. Rosie is the robot programmed to identify public funds expenses with discrepancies, starting with CEAP (Quota for Exercise of Parliamentary Activity); it analyzes each of the reimbursements requested by the deputies and senators, indicating the reasons that lead it to believe they are suspicious. From Rosie was born whistleblower, which tweets under the name of @RosieDaSerenata, distributing the results found on social media. Jarbas (Github repository) is a data visualization tool which shows a complete list of reimbursements made available by the Chamber of Deputies and mined by Rosie. Toolbox is a Python installable package that supports the development of Serenata de Amor and Rosie. == History == Operation Serenata de Amor is an Artificial intelligence project for analysis of public expenditures. It was conceived in March 2016 by data scientist Irio Musskopf, sociologist Eduardo Cuducos and entrepreneur Felipe Cabral. The project was financed collectively in the Catarse platform, where it reached 131% of the collection goal paying 3 months of project development. Ana Schwendler, also a data scientist, Pedro Vilanova "Tonny", data journalist, Bruno Pazzim, software engineer, Filipe Linhares, a frontend engineer, Leandro Devegili, an entrepreneur and André Pinho took the first steps towards constructing the platform, such as collecting and structuring the first datasets. Jessica Temporal, data scientist and Yasodara Córdova "Yaso", researcher, Tatiana Balachova "Russa", UX designer, joined the project after the financing took place. The members created a recurring financing campaign, expanding the analysis of public spending to the Federal Senate. Donors make monthly payments ranging from 5 BRL to 200 BRL to maintain group activities. The monthly amount collected is around 10,000 BRL. == Results == In January 2017, concluding the period financed by the initial campaign, the group carried out an investigation into the suspicious activities found by the data analysis system. 629 complaints were made to the Ombudsman's Office of the Chamber of Deputies, questioning expenses of 216 federal deputies. In addition, the Facebook project page has more than 25,000 followers, and users frequently cite the operation as a benchmark in transparency in the Brazilian government. One of the examples of results obtained by the operation is the case of the Deputy who had to return about 700 BRL to the House after his expenses were analyzed by the platform. The platform was able to analyze more than 3 million notes, raising about 8,000 suspected cases in public spending. The community that supports the work of the team benefits from open source repositories, with licenses open for the collaboration. So much so that the two main data scientists of the project presented it at the CivicTechFest in Taipei, obtaining several mentions even in the international press. The technical leader presented the project in Poland during DevConf2017 in Kraków. It was also presented in the Google News Lab in 2017. It was presented by Yaso, when she was the Director of the initiative, at the MIT Media Lab/Berkman Klein Center Initiative for Artificial Intelligence ethics, and at the Artificial Intelligence and Inclusion Symposium, an initiative of the Global Network of Internet & Society Centers (NoC). It was also presented both by Irio and Yaso at the Digital Harvard Kennedy School, over a lunch seminar, where the transparency of the platform and the main solutions found were discussed, so that the code and data are always available to verify its suitability. This infographic provides information about the first results of Operation Serenata de Amor, a project that analyzes open data on public spending to find discrepancies. The project was presented by Yaso to the House Audit and Control Committee of the Chamber of Deputies in August 2017, and raised the interest of House officials who work with open data. The operation has been a source of inspiration for other civic projects that aim to work with similar goals, demonstrating the broader impact of artificial intelligence also in industry in Brazil. Participation of several team members in events throughout Brazil and abroad can be found on the Internet, such as presentation at OpenDataDay, held at Calango Hackerspace in the Federal District, Campus Party Bahia, Campus Party Brasilia, Friends of Tomorrow, XIII National Meeting of Internal Control, in the event USP Talks Hackfest against corruption in João Pessoa, the latter being also highlighted in the National Press.

    Read more →
  • Artificial intelligence and elections

    Artificial intelligence and elections

    As artificial intelligence (AI) has become more mainstream, there is growing concern about how this will influence elections. Potential targets of AI include election processes, election offices, election officials and election vendors. There are also global efforts to improve elections using AI. == Tactics == Generative AI capabilities allow creation of misleading content. Examples of this include text-to-video, deepfake videos, text-to-image, AI-altered images, text-to-speech, voice cloning, and text-to-text. In the context of an election, a deepfake video of a candidate may propagate information that the candidate does not endorse. Chatbots could spread misinformation related to election locations, times or voting methods. In contrast to malicious actors in the past, these techniques require little technical skill and can spread rapidly. LLM-generated messages have the capacity to persuade humans on political issues. Researchers have begun to investigate how people rate messages that LLMs generate for how persuasive they are. When it came to policy issues, the LLM-generated messages received a 2.91 compared to a 2.80 when it came to smartness between the AI and humans. The LLM-generated messages were often more technical and analytical than human-generated messages. Generative AI has been used to micro-target people during tight political elections. The generation of targeted large language models has triggered concern that they will be used to leverage readily scale microtargeting. Rephrasing inputs have been used to generate fraudulent emails and phishing websites. Rephrasing inputs in a microtargeting does not violate the terms of OpenAI usage. There are no safeguards to prevent the use of rephrasing and creation of fraudulent emails. Political campaign managers have access to this allowing for them to create targeted content. == Usage by country == === Argentina === ==== 2023 elections ==== During the 2023 Argentine primary elections, Javier Milei's team distributed AI generated images including a fabricated image of his rival Sergio Massa and drew 3 million views. The team also created an unofficial Instagram account entitled "AI for the Homeland." Sergio Massa's team also distributed AI generated images and videos. === Bangladesh === ==== 2024 elections ==== In the run up to the 2024 Bangladeshi general election, deepfake videos of female opposition politicians appeared. Rumin Farhana was pictured in a bikini while Nipun Ray was shown in a swimming pool. === Canada === ==== 2025 elections ==== In the run up to the 2025 Canadian federal election, the use of AI tools is likely to figure prominently. India, Pakistan and Iran are all expected to make efforts to subvert the national vote using disinformation campaigns to deceive voters and sway diaspora communities. In a report by the Canadian Centre for Cyber Security called "Cyber Threats to Canada's Democratic Process: 2025 Update", it states that malicious actors including China and Russia: "are most likely to use generative AI as a means of creating and spreading disinformation, designed to sow division among Canadians and push narratives conducive to the interests of foreign states". === France === ==== 2024 elections ==== In the 2024 French legislative election, deepfake videos appeared claiming: i) That they showed the family of Marine le Pen. In the videos, young women, supposedly Le Pen's nieces, are seen skiing, dancing and at the beach "while making fun of France’s racial minorities": However, the family members don't exist. On social media there were over 2 million views. ii) In a video seen on social media, a deepfake video of a France24 broadcast appeared to report that the Ukrainian leadership had "tried to lure French president Emmanuel Macron to Ukraine to assassinate him and then blame his death on Russia". === Ghana === ==== 2024 elections ==== During the months before the December 2024 Ghanaian general election, a network of at least 171 fake accounts has been used to spam social media. Posts have been used by a group identified as "@TheTPatriots" to promote the New Patriotic Party, although it is not known whether the two are connected. All the networks' posts were "highly likely" to have been generated by ChatGPT and appear to be the "first secretly partisan network using AI to influence elections in Ghana". The opposition National Democratic Congress was also criticized with its leader John Mahama being called a drunkard. === India === ==== 2024 elections ==== In the 2024 Indian general election, politicians used deepfakes in their campaign materials. These deepfakes included politicians who had died prior to the election. Mathuvel Karunanidhi's party posted with his likeness even though he had died 2018. A video The All-India Anna Dravidian Progressive Federation party posted showed an audio clip of Jayaram Jayalalithaa even though she had died in 2016. The Deepfakes Analysis Unit (DAU) is an open source platform created in March 2024 for the public to share misleading content and assess if it had been AI-generated. AI was also used to translate political speeches in real time. This translating ability was widely used to reach more voters. === Indonesia === ==== 2024 elections ==== In the 2024 Indonesian presidential election, Prabowo Subianto made extensive use of AI-generated art in his campaign, which ranged from images of himself as an adorable child to various child portrayals in his advertisements. The Indonesian Children's Protection Commission condemned these ads, labeling them as a form of misuse. Other candidates, Anies Baswedan and Ganjar Pranowo, also incorporated AI art into their campaigns. Throughout the election period, all presidential candidates faced attacks from deepfakes, both in video and audio formats. === Ireland === ==== 2024 elections ==== In the last weeks of the 2024 Irish general election a spoof election poster appeared in Dublin featuring "an AI-generated candidate with three arms". The candidate is called Aidan Irwin, but no-one stood in the election with that name. A slogan on the poster says "put matters into artificial intelligence’s hands". The convincing election poster shows a man that "has six fingers on one hand, three arms, and a distorted thumb". === New Zealand === ==== 2023 elections ==== In May 2023, ahead of the 2023 New Zealand general election in October 2023, the New Zealand National Party published a "series of AI-generated political advertisements" on its Instagram account. After confirming that the images were faked, a party spokesperson said that it was "an innovative way to drive our social media". === Pakistan === ==== 2024 elections ==== AI has been used by the imprisoned ex-Prime Minister Imran Khan and his media team in the 2024 Pakistani general election: i) An AI generated audio of his voice was added to a video clip and was broadcast at a virtual rally. ii) An op-ed in The Economist written by Khan was later claimed by himself to have been written by AI which was later denied by his team. The article was liked and shared on social media by thousands of users. === South Africa === ==== 2024 elections ==== In the 2024 South African general election, there were several uses of AI content: i) A deepfaked video of Joe Biden emerged on social media showing him saying that "The U.S. would place sanctions on SA and declare it an enemy state if the African National Congress (ANC) won". ii) In a deepfake video, Donald Trump was shown endorsing the uMkhonto weSizwe party. It was posted to social media and was viewed more than 158,000 times. iii) Less than 3 months before the elections, a deepfake video showed U.S. rapper Eminem endorsing the Economic Freedom Fighters party while criticizing the ANC. The deepfake was viewed on social media more than 173,000 times. === South Korea === ==== 2022 elections ==== In the 2022 South Korean presidential election, a committee for one presidential candidate Yoon Suk Yeol released an AI avatar 'Al Yoon Seok-yeol' that would campaign in places the candidate could not go. The other presidential candidate Lee Jae-myung introduced a chatbot that provided information about the candidate's pledges. ==== 2024 elections ==== Deepfakes were used to spread misinformation before the 2024 South Korean legislative election with one source reporting 129 deepfake violations of election laws within a two week period. Seoul hosted the 2024 Summit for Democracy, a virtual gathering of world leaders initiated by US President Joe Biden in 2021. The focus of the summit was on digital threats to democracy including artificial intelligence and deepfakes. === Taiwan === ==== 2024 elections ==== AI-generated content was used during the 2024 Taiwanese presidential election. Among the media were: i) A deepfake video of General Secretary of the Chinese Communist Party Xi Jinping which showed him supporting the presidential elections. Created on social media, the video was "widely circulated

    Read more →
  • Hallucination (artificial intelligence)

    Hallucination (artificial intelligence)

    In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where a hallucination typically involves false percepts. For example, a chatbot powered by large language models (LLMs), like ChatGPT, may embed plausible-sounding random falsehoods within its generated content. Detecting and mitigating errors and hallucinations pose significant challenges for practical deployment and reliability of LLMs in high-stakes scenarios, such as chip design, supply chain logistics, and medical diagnostics. Some software engineers and statisticians have criticized the specific term "AI hallucination" for unreasonably anthropomorphizing computers. Symbolic artificial intelligence models generally do not produce hallucinations, unlike large language models. == Term == === Origin === Since the 1980s, the term "hallucination" has been used in computer vision with a positive connotation to describe the process of adding detail to an image. For example, the task of generating high-resolution face images from low-resolution inputs is called face hallucination. The first documented use of the term "hallucination" in this sense is in the PhD thesis of Eric Mjolsness in 1986. A notable work is the face hallucination algorithm by Simon Baker and Takeo Kanade published in 1999. In the 2000s, hallucinations were described in statistical machine translation as a failure mode. Since the 2010s, the term has undergone a semantic shift to signify the generation of factually incorrect or misleading outputs by AI systems in tasks like machine translation and object detection. In 2015, hallucinations were identified in visual semantic role labeling tasks by Saurabh Gupta and Jitendra Malik. In 2015, computer scientist Andrej Karpathy used the term "hallucinated" in a blog post to describe his recurrent neural network (RNN) language model generating an incorrect citation link. In 2017, Google researchers used the term to describe the responses generated by neural machine translation (NMT) models when they are not related to the source text, and in 2018, the term was used in computer vision to describe instances where non-existent objects are erroneously detected because of adversarial attacks. In July 2021, Meta warned during its release of BlenderBot 2 that the system is prone to "hallucinations", which Meta defined as "confident statements that are not true". Following OpenAI's ChatGPT release in beta version in November 2022, some users complained that such chatbots often seem to pointlessly embed plausible-sounding random falsehoods within their generated content. Many news outlets, including The New York Times, started to use the term "hallucinations" to describe these models' frequently incorrect or inconsistent responses. In 2023, the Cambridge dictionary updated its definition of hallucination to include this new sense specific to the field of AI. Some researchers have highlighted a lack of consistency in how the term is used, but also identified several alternative terms in the literature, such as confabulations, fabrications, and factual errors. === Definitions and alternatives === Uses, definitions and characterizations of the term "hallucination" in the context of LLMs include: "a tendency to invent facts in moments of uncertainty" (OpenAI, May 2023) "a model's logical mistakes" (OpenAI, May 2023) "fabricating information entirely, but behaving as if spouting facts" (CNBC, May 2023) "making up information" (The Verge, February 2023) "probability distributions" (in scientific contexts) Journalist Benj Edwards, in Ars Technica, writes that the term "hallucination" is controversial, but that some form of metaphor remains necessary; Edwards suggests "confabulation" as an analogy for processes that involve "creative gap-filling". In July 2024, a White House report on fostering public trust in AI research mentioned hallucinations only in the context of reducing them. Notably, when acknowledging David Baker's Nobel Prize-winning work with AI-generated proteins, the Nobel committee avoided the term entirely, instead referring to "imaginative protein creation". Hicks, Humphries, and Slater, in their article in Ethics and Information Technology, argue that the output of LLMs is "bullshit" under Harry Frankfurt's definition of the term, and that the models are "in an important way indifferent to the truth of their outputs", with true statements only accidentally true, and false ones accidentally false. Some researchers also use the derogatory term "botshit", often referring to uncritical use of AI. === Criticism === In the scientific community, some researchers avoid the term "hallucination", seeing it as potentially misleading. It has been criticized by Usama Fayyad, executive director of the Institute for Experimental Artificial Intelligence at Northeastern University, on the grounds that it misleadingly personifies large language models and is vague. Mary Shaw said, "The current fashion for calling generative AI's errors 'hallucinations' is appalling. It anthropomorphizes the software, and it spins actual errors as somehow being idiosyncratic quirks of the system even when they're objectively incorrect." In Salon, statistician Gary Smith argues that LLMs "do not understand what words mean" and consequently that the term "hallucination" unreasonably anthropomorphizes the machine. Murray Shanahan argues that anthropomorphic framing of LLM capabilities, including terms like "hallucination", encourages users and researchers to attribute cognitive processes to systems that operate through statistical pattern completion, and advocates for more careful linguistic practices when discussing LLM behavior. Kristina Šekrst argues that applying psychological vocabulary to LLM outputs obscures the difference between the appearance of mental properties and their genuine presence. Förster & Skop assert that tech companies use the hallucination metaphor to anthropomorphize models and deflect responsibility for non-factual outputs. Some see the AI outputs not as illusory but as prospective—that is, having some chance of being true, similar to early-stage scientific conjectures. The term has also been criticized for its association with psychedelic drug experiences. == In natural language generation == In natural language generation, there are several reasons why natural language models hallucinate: === Hallucination from data === Hallucinations can stem from incomplete, inaccurate or unrepresentative data sets. === Modeling-related causes === The pre-training of generative pretrained transformers (GPT) involves predicting the next word. It incentivizes GPT models to "give a guess" about what the next word is, even when they lack information. Some researchers take an anthropomorphic perspective and posit that hallucinations arise from a tension between novelty and usefulness. For instance, Amabile and Pratt define human creativity as the production of novel and useful ideas. By extension, a focus on novelty in machine creativity can lead to the production of original but inaccurate responses—that is, falsehoods—whereas a focus on usefulness may result in memorized content lacking originality. By 2022, newspapers such as The New York Times expressed concern that, as the adoption of bots based on large language models continued to grow, unwarranted user confidence in bot output could lead to problems. === Interpretability research === In 2025, interpretability research by Anthropic on the LLM Claude identified internal circuits that cause it to decline to answer questions unless it knows the answer. By default, the circuit is active and the LLM doesn't answer. When the LLM has sufficient information, these circuits are inhibited and the LLM answers the question. Hallucinations were found to occur when this inhibition happens incorrectly, such as when Claude recognizes a name but lacks sufficient information about that person, causing it to generate plausible but untrue responses. === Examples === On 15 November 2022, researchers from Meta AI published Galactica, designed to "store, combine and reason about scientific knowledge". Content generated by Galactica came with the warning: "Outputs may be unreliable! Language Models are prone to hallucinate text." In one case, when asked to draft a paper on creating avatars, Galactica cited a fictitious paper from a real author who works in the relevant area. Meta withdrew Galactica on 17 November due to offensiveness and inaccuracy. OpenAI's ChatGPT, released in beta version to the public on November 30, 2022, was based on the foundation model GPT-3.5 (a revision of GPT-3). Professor Ethan Mollick of Wharton called it an "omniscient, eager-to-please intern who sometimes lies to you". Data scientist Teresa Kuba

    Read more →
  • Showcase Workshop

    Showcase Workshop

    Showcase Workshop, also referred to as Showcase, is a SaaS company that develops a presentation-building application for business use. Users upload files and images to a web platform which generates presentations viewable on a suite of mobile apps. Showcase was founded in 2011. The company’s headquarters are in Wellington, New Zealand. == History == Showcase Workshop was originally developed in response to dynamically changing content being presented on iPads at the 2012 Olympics. After market-testing a beta version of the core application, Showcase Workshop launched commercially in 2012. In 2014 Showcase partnered with Vodafone Global Enterprise. == Product == Users upload pre-existing PDFs, videos, images and Microsoft Office documents to a secure server, building presentations or ‘showcases’ which can then be downloaded via the mobile apps. The presentations are used for mobile sales enablement, training, or operational/health and safety purposes. == Reception == Reviewers have praised the ease of use of Showcase, calling it a “better alternative to developing a native app” and “intuitive”. Criticisms include the lack of differing templates and a lack of complex customisation controls. Showcase was nominated for a Tabby Award in 2014 and won a Tabby Award in 2015 for its Windows app.

    Read more →
  • Biohybrid system

    Biohybrid system

    Biohybrid systems refer to the integration of biological materials, such as cells or tissues, with artificial components, including electronics or mechanical structure. This combination incorporates the capabilities of living organisms with the precision of man-made technology. As a result, these systems perform tasks that neither biology nor machines could achieve independently. Biohybrid systems might use lab-cultured muscle cells to power small robots or combine sensors with living tissue for better health sensing. The intent behind these systems is to combine the benefits of biological and technological components to introduce new solutions for complex medical challenges. Biohybrid systems may have transformative potential across sectors, such as robotics to create actuators and sensors that mimic natural muscle and nerve function, medicine in developing smart implants and drug delivery systems, in prosthetics for enhancing user control through neural or muscular interfaces and environmental sustainability for deploying biohybrid solutions for pollution sensing or remediation. == Origin == The term "biohybrid" is a compound of "bio" from biology (meaning life) and "hybrid" (referring to a combination of distinct elements), denoting a field of study. Its use helps distinguish such systems from purely biological constructs or entirely synthetic machines. Early academic mentions may include bio actuated robotics papers and foundational tissue-robot integration studies published in journals like Nature Biotechnology or Science Robotics. The emergence of the term reflects a growing recognition of the need to describe systems that do not fit cleanly into traditional categories. == Design principles == One of the most significant biohybrid challenges is to engineer interfaces between living tissue and artificial materials that are efficient. This means having precise control over adhesion at the surface, diffusion of nutrients, and signal conduction. Actuation mechanisms within the heart of these systems generate movement or mechanical response. These may be in the form of living muscle cells such as skeletal myocytes or cardiomyocytes, soft pneumatic actuators, or electrical stimulation-responsive tissues. Materials selection is equally critical. Hydrogels, elastomers like PDMS (polydimethylsiloxane), and biopolymers are commonly used due to their softness and biocompatibility. These materials must support cell viability, resist immune attack, and allow the integration of mechanical or electrical components. == Key components == At their core, biohybrid systems work by bridging living biological parts with technology. Through this integration, functionality that neither system could accomplish singularly is possible. Biological parts may be cells, tissues, or even organs—occasionally cultured in a laboratory setting. These biological parts carry out biologically inspired behaviors, such as muscle contraction or chemical sensing in the body. Technological components may constitute devices like sensors, electronic components, and mechanical structure. These manipulate the system, supply power, or transfer data. An example is a sensor that is implantable within a body and detects glucose levels as it sends information to a smart phone. By integrating these artificial and biological parts, biohybrid systems can perform advanced functions, such as tissue regeneration, real-time health monitoring, or the recovery of motor function in paralysis patients. Biohybrid systems generally consist of two major components: the biological and the mechanical. Biological components may include muscle cells for contraction, endothelial cells for vascularization, and stem cells for regenerative capabilities. Mechanical components comprise soft actuators that mimic organic motion, synthetic scaffolds that provide support and structure, and microfluidic systems that facilitate the delivery of nutrients and removal of waste. These components are combined in a manner that allows for dynamic, lifelike behavior—such as the contraction of tissue or the propagation of mechanical waves—while maintaining biocompatibility and durability. == Applications == The range of applications for biohybrid systems is broad and continuously expanding. In robotics, biohybrid structures have been used to engineer microscopic, muscle-driven machines, such as Harvard University's biohybrid stingray robot. In medical applications, they offer new alternatives for organ repair and augmentation, including biohybrid heart valves and esophageal scaffolds. Biohybrids are also promising in neural interfaces, where the goal is to create long-lasting, stable interaction between mechanical devices and brain tissue. Muscle-actuated drug response platforms are under exploration in pharmacology for modelling and real-time screening. == Examples == Several high-profile research projects have demonstrated the potential of biohybrid systems: Harvard researchers developed a biohybrid swimming ray powered by rat cardiac cells layered onto a gold skeleton, mimicking the motion of a real stingray. At the Massachusetts Institute of Technology, a cardiac pump actuated entirely by living heart muscle cells was engineered to simulate the behavior of a beating heart. Bio actuated soft robots have been built to simulate gut peristalsis, using muscle contractions to replicate natural wave-like movement in the digestive tract. == Challenges and limitations == As with many technologies that involve living systems, biohybrid systems raise important ethical and biomedical questions. Cell sourcing remains a key issue, particularly when embryonic or animal-derived cells are used. Long-term viability is another concern—living tissues must be kept alive with nutrients and oxygen, and they often degrade or elicit immune responses when implanted. Powering these biological parts presents logistical and ethical hurdles as well. Systems must either include internal mechanisms for nutrient delivery or be supported externally, which can limit portability and independence. == Future directions == Researchers are exploring self-directed, self-regulated organ substitutes and regenerative implants that can respond to their surroundings in real-time. These systems may be integrated with artificial intelligence to make them adjust to stimuli and coordinate complex behaviors. Future potential applications are wearable biohybrid systems for rehabilitation, space medicine devices for long-duration missions, and implantable devices that integrate into human physiology.

    Read more →
  • Reasoning model

    Reasoning model

    A reasoning model, also known as a reasoning language model (RLM) or large reasoning model (LRM), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute. == Overview == Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams. In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems. == History == The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference. Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks. Input → Step 1 → Step 2 → ⋯ → Step n ⏟ Reasoning chain → Answer {\displaystyle {\text{Input}}\rightarrow \underbrace {{\text{Step}}_{1}\rightarrow {\text{Step}}_{2}\rightarrow \cdots \rightarrow {\text{Step}}_{n}} _{\text{Reasoning chain}}\rightarrow {\text{Answer}}} A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning. Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts. OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s "Let's Verify Step by Step" reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment. OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference. Together, these developments define the core of reasoning models. They use supervision signals that evaluate the quality of intermediate steps, they exploit inference-time exploration such as consensus or tree search, and they expose controls for how much internal thinking compute to allocate. OpenAI's o1 family made this approach available at scale in September 2024 and popularized the label "reasoning model" for LLMs that deliberately think before they answer. The development of reasoning models illustrates Richard S. Sutton's "bitter lesson" that scaling compute typically outperforms methods based on human-designed insights. This principle was demonstrated by researchers at the Generative AI Research Lab (GAIR), who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the "o1 Replication Journey" series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions. === Drawbacks === Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination (AIME) benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts. The extended inference time is attributed to the detailed, step-by-step reasoning outputs that these models generate, which are typically much longer than responses from standard large language models that provide direct answers without showing their reasoning process. One researcher in early 2025 argued that these models may face potential additional denial-of-service concerns with "overthinking attacks." === Releases === ==== 2024 ==== In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities. The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024, with the full o3 model becoming available in 2025. Alibaba released reasoning versions of its Qwen large language models in November 2024. In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model. In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks. On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models. ==== 2025 ==== In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO), a reinforcement learning technique used to train the model. On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks. Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities. On February 2, 2025, OpenAI released Deep Research, a feature powered by their o3 model that enables users to conduct comprehensive research tasks. The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources. OpenAI called GPT-4.5 its "last non-chain-of-thought model", and implemented with GPT-5 a router model that selects a model based on the difficulty of the task. ==== 2026 ==== In January 2026, Moonshot AI released Kimi K2.5, an open-source 1 trillion parameter MoE model with 32 billion active parameters. It uses an “Agent Swarm” system that dynamically decomposes tasks into sub-agents for reasoning and execution, enabling more scalable multi-step problem solving than a single sequential reasoning chain. == Training == Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-

    Read more →