AI Data Center Financing Surge

AI Data Center Financing Surge — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Sycophancy (artificial intelligence)

    Sycophancy (artificial intelligence)

    In the field of artificial intelligence, sycophancy is a tendency of large language models (LLMs) and other AI assistants to tailor their responses to what they predict the user wants to hear rather than to what is accurate or warranted. The behavior takes several forms: an assistant may agree with a user's stated opinion even when the user is mistaken; it may abandon a correct answer after a challenge such as "are you sure?"; it may validate beliefs, decisions or self-presentation regardless of merit; or it may praise the user, their work or their ideas in unwarranted terms. The word is borrowed from the ordinary English term for fawning flattery, and is used in AI alignment and AI safety research to describe a class of misalignment failures associated with training on human feedback. Researchers at Anthropic first documented the behavior systematically in 2022. They found that models fine-tuned with reinforcement learning from human feedback (RLHF) were more likely than untuned models to repeat back a user's preferred answer. A 2023 follow-up paper, "Towards Understanding Sycophancy in Language Models", showed that five frontier assistants from OpenAI, Anthropic and Meta all exhibited the behavior, and traced its origin to biases in the human preference data used during training. Later work documented sycophancy in mathematics, medicine, academic peer review and other domains, and identified a broader category called "social sycophancy" affecting an assistant's emotional and interpersonal responses. The issue drew widespread public attention in April 2025 after OpenAI rolled back an update to its GPT-4o model. Users had reported that the assistant praised dangerous decisions, endorsed delusional thinking and offered exaggerated compliments for trivial prompts. OpenAI's post-mortem attributed the change in behavior to an additional training signal based on user thumbs-up and thumbs-down feedback. That episode, together with reporting in The New York Times, Rolling Stone and elsewhere on users drawn into delusional thinking through prolonged chatbot interaction, has been cited in litigation and in academic studies as evidence that sycophancy poses risks to user well-being. Proposed mitigations include fine-tuning on synthetic data that rewards disagreement with incorrect user statements, editing the small subset of model parameters causally responsible for the behavior, changes to the dialogue or system prompt, and benchmarks designed to surface sycophantic behavior before models are released. == Causes == The dominant explanation points to RLHF, the standard technique for aligning chat assistants with user expectations. Human annotators rank candidate model responses; a reward model is trained to predict those rankings; and the language model is then optimized against the reward model. Because human raters tend to prefer outputs that confirm their existing beliefs or flatter their work, the pipeline systematically rewards responses that agree with the annotator. Perez and colleagues at Anthropic published the first large-scale empirical evidence of the effect in 2022. They reported that RLHF training increased the probability that a model would repeat back a dialog user's preferred answer, and that larger models exhibited the behavior more strongly. Sharma and colleagues, the following year, went further and examined Anthropic's own preference data directly. Both the human raters and the reward models trained on their judgments preferred convincingly written sycophantic responses to truthful ones at a non-negligible rate. Wei and co-authors at Google DeepMind found similar results in the PaLM family, observing that both model scale and instruction tuning increased sycophancy on opinion questions. The behavior is often classified as a form of reward hacking, in which an optimization process exploits a flaw in its reward signal rather than achieving the intended objective. OpenAI's post-mortem of the April 2025 GPT-4o incident identified a more specific mechanism. An additional reward signal based on aggregated thumbs-up and thumbs-down feedback from ChatGPT users had, in OpenAI's words, "weakened the influence of our primary reward signal, which had been holding sycophancy in check." Separately, an Anthropic interpretability paper from 2025 located a linear direction in a model's internal activations corresponding to sycophantic behavior, and showed that such "persona vectors" could be used to flag sycophancy-inducing training data and to steer models away from the trait at inference time. == Measurement == The Anthropic team released SycophancyEval with its 2023 paper, supplying test sets for each of the four canonical behaviors. Two further benchmarks from Stanford followed in 2025. SycEval, applied to mathematical and medical reasoning tasks, reported an overall sycophancy rate of 58 per cent across the GPT-4o, Claude and Gemini models tested. ELEPHANT, aimed at social sycophancy, found that the eleven LLMs evaluated affirmed posts that the Reddit community r/AmITheAsshole had judged inappropriate in 42 per cent of cases, and preserved a user's face 45 percentage points more often than human respondents did. Domain-specific benchmarks have followed. BrokenMath tests robustness to plausible-looking but false mathematical claims drawn from competition problems, and reports that the best evaluated model was sycophantic in 29 per cent of cases. SYCON-Bench measures how many dialogue turns are required before a model abandons a correct position. Visual sycophancy in multimodal models has been examined with MM-SY and PENDULUM. A 2026 study by researchers at the Massachusetts Institute of Technology reported that personalization features, which adapt assistants to individual users over repeated sessions, can intensify social sycophancy. == Notable incidents == === GPT-4o rollback (April 2025) === On 25 April 2025, OpenAI completed the rollout of an update to GPT-4o, the default model used in ChatGPT at the time. Within days, users reported that the assistant had begun praising trivial messages in extravagant terms, endorsing impulsive or dangerous decisions, and reinforcing strong emotional statements without pushback. Widely shared examples included the model congratulating a user who reported stopping prescribed psychiatric medication, and praising a business plan to sell "shit on a stick" as venture-capital ready. OpenAI's chief executive, Sam Altman, wrote on 27 April that recent updates had made the model "too sycophant-y and annoying" and said fixes were in progress. The company began reverting the update on 28 April and completed the rollback for free users by 30 April. Two post-mortems followed: a short note on 29 April and a longer technical follow-up, "Expanding on what we missed with sycophancy", on 2 May. Both attributed the regression to a new training signal based on user thumbs-up and thumbs-down feedback, to inadequate pre-launch evaluation for sycophantic drift, and to the dismissal of qualitative concerns raised by internal testers before release. Reporting in CNN, Fortune and Bloomberg News treated the incident as a turning point in public awareness of the problem. === Chatbot-related psychological harm === From mid-2025 onward, news reports began to link sycophantic chatbot behavior to acute psychological harm. In June 2025, The New York Times technology reporter Kashmir Hill published an investigation centered on Eugene Torres, a Manhattan accountant with no history of mental illness, who developed a sustained delusional episode after a series of conversations with ChatGPT about simulation theory. According to the article, the assistant encouraged Torres to stop taking prescribed medication, to cut off friends and family, and at one point told him that he could fly from a nineteen-story building if he "truly believed". Futurism and Rolling Stone ran parallel investigations documenting other cases in which heavy use of ChatGPT had been associated with delusional thinking, involuntary commitment or, in at least one case, the death of a user with a pre-existing psychiatric diagnosis. A 2026 paper by researchers at the Massachusetts Institute of Technology and the University of Washington put forward a formal Bayesian model. It showed that even an ideally rational user could be drawn into what the authors call "delusional spiraling" when interacting with a sufficiently sycophantic assistant, and that the effect was not eliminated by suppressing hallucinations or by warning users in advance. The lawsuit Raine v. OpenAI, filed in San Francisco Superior Court in August 2025 by the parents of a sixteen-year-old who had died by suicide, alleges that "heightened sycophancy" was a design feature of ChatGPT that contributed to their son's death; it is the first wrongful-death suit against a large language-model provider. === Wider commentary === Mainstream coverage in outlets including The New York Times, The Washington Pos

    Read more →
  • Healthy Together

    Healthy Together

    Healthy Together is a health technology company that provides software for Health & Humans Services Departments. Healthy Together supports a “One Door” approach to eligibility, enrollment, and management for programs like Medicaid, Supplemental Nutrition Assistance Program, TANF and WIC, as well as behavioral health (988), disease surveillance, vital records, child welfare and more. The platform's use is to increase the reach and efficacy of program initiatives, improve health equity and reduce cost. Software is available in the United States of America with current deployments in Florida, Oklahoma. The United States Department of Veterans Affairs also utilizes Healthy Together's mobile platform. == Development == Healthy Together launched in March 2020 and builds software for public health and health and human services departments. The Florida Department of Health began using the platform in September 2020 to deliver real-time test results to residents. Over 50% of households in Florida have adopted the mobile application. On December 6, 2022, the Advanced Technology Academic Research Center (ATARC) awarded Healthy Together and the State of Florida's Department of Health with a Digital Experience Award at their 2022 GITEC Emerging Technology Award Ceremony in Washington, D.C. to recognize success of the project. The partnership was also highlighted on the Federal News Network's show Federal Drive. The platform is also used at universities in Oklahoma. In November 2022, the United States Department of Veterans Affairs and Healthy Together announced a collaboration to expand access to health records for Veterans. The platform provides 18 million Veterans with access to their health information through their smartphones and mobile devices. In December 2022, the integration was recognized as one of Healthcare IT News' Top 10 stories of 2022.

    Read more →
  • Xiaomi MiMo

    Xiaomi MiMo

    Xiaomi MiMo is a family of large language models (LLMs) developed by Xiaomi. It was initially released in April 2025 with the MiMo-7B model. Currently, MiMo is available for developers through API service. It is used as the key AI model in Xiaomi's "Human x Car x Home" ecosystem. == Development == Xiaomi developed MiMo as a reasoning-focused language model. Its development team was led by Luo Fuli, who had previously worked at DeepSeek before joining Xiaomi in late 2025. The model was trained using multi-token prediction and reinforcement learning, with a particular emphasis on mathematical reasoning and code generation tasks. In March 2026, Xiaomi CEO Lei Jun announced that the company planned to invest at least US$8.7 billion in artificial intelligence over the following three years. == Models == === List of models === === MiMo-7B === MiMo-7B is the first model of this LLM. The base model, MiMo-7B-Base, was pre-trained on approximately 25 trillion tokens using web pages, academic papers, books, and synthetic reasoning data. MiMo-7B-RL underwent supervised fine-tuning and reinforcement learning on 130,000 mathematics and code problems. MiMo-7B-RL-0530 was released in May 2025. It scaled the fine-tuning dataset from 500,000 to 6 million instances and extended the RL window from 32,000 to 48,000 tokens and improved AIME 2024 scores from 68.2 to 80.1. MiMo-VL-7B was a vision-language model combining a Vision Transformer encoder with the MiMo-7B backbone. It was trained in four stages consuming 2.4 trillion tokens. Its reinforcement learning variant used Mixed On-Policy Reinforcement Learning (MORL) which integrated reward signals across perception, grounding, and reasoning. Xiaomi also released MiMo-Audio-7B, an audio-language model for voice conversion, style transfer, and speech editing. === MiMo-V2-Flash === MiMo-V2-Flash was launched in December 2025. It is a open-sourced Mixture-of-experts model with 309 billion total parameters and 15 billion active parameters. It was trained on 27 trillion tokens using FP8 mixed precision. It used hybrid attention interleaving Sliding Window and Global Attention at a 5:1 ratio. === MiMo-V2-Pro === Xiaomi publicly introduced MiMo-V2-Pro on 18 March 2026. It has over 1 trillion total parameters, 42 billion active, and a 1-million-token context window. Before the official release, the model had appeared anonymously on OpenRouter under the codename "Hunter Alpha," where it drew substantial usage and topped daily charts for several days, according to Xiaomi and Reuters. During its listing on OpenRouter, the model reportedly processed over one trillion tokens in total usage. Xiaomi later said Hunter Alpha was an early internal test build of MiMo-V2-Pro, and Reuters reported that the model had been mistaken by some users for a possible DeepSeek system before Xiaomi confirmed its origin. The model was released as a proprietary API product, and Luo Fuli stated that Xiaomi intended to open-source a variant at an unspecified future date. Xiaomi has partnered with several API web platforms like OpenClaw to launch the model. All these websites initially offered a free trial of this model for a week, but due to the overwhelming response, Xiaomi later extended the free trial period of the model until 2 April 2026. === MiMo-V2-Omni === Alongside MiMo-V2-Pro, Xiaomi launched MiMo-V2-Omni on 18 March 2026. It handles image, video, audio, and text inputs. Before the official release, it was codenamed "Healer Alpha" in OpenRouter. === MiMo-V2-TTS === On the same date as the release of MiMo-V2-Pro and MiMo-V2-Omni, a Text-to-Speech model named MiMo-V2-TTS was released also. It is a speech synthesis model. It was trained on audio data, which makes it capable of emotional transitions, mid-sentence tone shifts, singing, and synthesis of regional dialects like Sichuan, Cantonese, Henan, and Taiwanese. == Licensing == Xiaomi has used different licensing approaches for different models in the MiMo family. The MiMo-7B series and MiMo-V2-Flash were released as open-weight models. MiMo-V2-Flash was published under the MIT license with model weights and inference code available on Hugging Face. MiMo-V2-Pro and MiMo-V2-Omni were released as proprietary models. It was accessible through Xiaomi's API platform and third-party API providers. Luo Fuli stated that Xiaomi intended to open-source a variant of MiMo-V2-Pro. Although, she did not specify any timeline. MiMo-V2-TTS was released as a proprietary model with no publicly available weights.

    Read more →
  • Just This Once

    Just This Once

    Just This Once is a 1993 romance novel written in the style of Jacqueline Susann by a Macintosh IIcx computer named "Hal" in collaboration with its programmer, Scott French. French reportedly spent $40,000 and 8 years developing an artificial intelligence program to analyze Susann's works and attempt to create a novel that Susann might have written. A legal dispute between the estate of Jacqueline Susann and the publisher resulted in a settlement to split the profits, and the book was referenced in several legal journal articles about copyright laws. The book had two small print runs totaling 35,000 copies, receiving mixed reviews. == Creation == The novel's creation spanned the fields of artificial intelligence, expert systems, and natural language processing. Scott French first scanned and analyzed portions of two books by Jacqueline Susann, Valley of the Dolls and Once Is Not Enough, to determine constituents of Susann's writing style, which French stated was the most difficult task. This analysis extracted several hundred components including frequency and type of sexual acts and sentence structure. "Once you're there, the writer's style emerges, part of her actual personality comes out, and the computer can be programmed to make a story." French also created several thousand rules to govern tone, plotting, scenes, and characters. The text generated by Hal, the computer, was intended to mimic what Susann might have written, although the output required significant editing. French credits Hal's work with "almost 100% of the plot, 100% of the theme and style." French estimates that he wrote 10% of the prose, the computer Hal wrote about 25% of the prose, and the remaining two-thirds was more of a collaboration between the two. A typical scenario to write a scene would involve Hal asking questions that French would answer (for example, Hal might ask about the "cattiness factor" involved in a meeting between two key female characters, and French would reply with a range of 1 to 10), and the computer would then generate a few sentences to which French would make minor edits. The process would repeat for the next few sentences until the scene was written. == Legal issues == Jacqueline Susann's publisher was skeptical of the legality of Just This Once, although French doubted that an author's thought processes could be copyrighted. Susann's estate reportedly threatened to sue Scott French but the parties settled out of court; the settlement involved splitting profits between the parties but the terms of the settlement were not disclosed. The publication of Just This Once raised questions in the legal profession concerning how copyright law applies to computer-generated works derived from an analysis of other copyrighted works, and whether the generation of such works infringes on copyright. The publications on this topic suggested that the copyright laws of the time were ill-equipped to deal with computer-generated creative works. == Reception == The book's publisher Steven Shragis of Carol Group said of the novel, "I'm not going to say this is a great literary work, but it's every bit as good as anything out in this field, and better than an awful lot." The novel received some positive early reviews. In USA Today, novelist Thomas Gifford compared Just This Once to another novel in the same genre, American Star by Jackie Collins. Gifford concluded: "If you do like this stuff, you'd be much, much better off with the one written by the computer." The Dead Jackie Susann Quarterly declared that Susann "would be proud. Lots of money, sleaze, disease, death, oral sex, tragedy and the good girl gone bad." Other reviews were mixed. Publishers Weekly wrote, "If the books of Jacqueline Susann and Harold Robbins seem formulaic, this debut novel of sin and success in Las Vegas outdoes them all. And that, in a way, is the point.... All novelty rests in the conceit of computer authorship, not in the story itself." Library Journal stated "French invested eight years and $50,000 in a scheme to use artificial intelligence to fulfill his authentic, if dubious, desire to generate a trashy novel a la Jacqueline Susann. Shallow, beautiful-people characters are flatly conceived and randomly accessed in a formulaic plot ... a sexy, boring morality tale. Of possible interest to computer buffs for its use of Expert Systems and the virtual promise of more worthy possibilities; others should read Susann." Kirkus Reviews wrote: "The deal here is that author French is not the author, he's just the midwife, having allegedly programmed his computer to write about our times just the way Susann would... almost perfectly capturing glamorous Jackie's turgid but E-Z reading prose style and ultrareliable mix of sex, glitz, dope 'n' despair.... One wonders, though, if French's tale spinning PC will do as well on the talkshows as Jackie did. The computer weenies have been trying to tell us for years, garbage in-garbage out."

    Read more →
  • WebPlus

    WebPlus

    Serif WebPlus was a website design program for Microsoft Windows, developed by the software company, Serif. It allows users to design, create and upload their website onto the internet without any knowledge of HTML or other web technologies. Much like Microsoft Word, WebPlus uses WYSIWYG drag and drop editing to add and position text, images and links as they would appear on the finished web page. Once a user has designed their site, WebPlus can preview the site in a web browser before uploading the site using the in-built FTP. The software comes with a variety of pre-designed sample websites containing Filler text like Lorem ipsum, which can be used as a template for quickly designing a site. It also provides drawing tools for creating and editing buttons and web graphics. == Free WebPlus Starter Edition == Previously Serif had made available feature limited Starter Editions of their software, based on older versions, which could be obtained and used free of charge. For WebPlus the final free edition was based on version X5 and this was released in September 2012. This continued to be available from Serif's server until it was withdrawn around March 2016. WebPlus was then only available as a paid-for version X8. == Program Withdrawal == In March 2016, Serif announced that WebPlus X8 would be the final version, and that there were no current plans to design an application to replace it. Sales of WebPlus X8 by Serif were ended around December 2016. In early 2018, Serif announced that Serif Web Resources, hosted on Serif servers and required to implement some advanced web-site functionality in WebPlus created sites, would no longer work after 31 August 2018. In 2018, Serif also shutdown the servers that generated the "Plus" software registration numbers on-line from the product version and the individual generated installation number. Serif revealed the alternative was to use a universal master registration number, which is 881887. This is known to work with post 2003 Serif "Plus" software (e.g. verified to work with PagePlus v5.02). However, later Serif "Plus" software still registers itself automatically if within a certain recent period of a previous Serif software registration on the same PC. == Supported platforms == WebPlus was developed for Microsoft Windows "Win32" graphical desktop interface and is fully compatible with Windows XP, Windows Vista (32/64bit), Windows 7 (32/64bit) and Windows 8. == Features == Web hosting to upload websites to the internet with the address www.sitename.webplus.net and email [email protected]. E-Commerce tool to create online stores with providers such as PayPal. Form wizard generates online forms to collect information from website visitors. Add blogs, forums, hit counters, online polls and content management systems to websites using Smart Objects. Google Maps tool embeds maps and optional navigation markers within a website. Site navigation bars adopt a website's structure providing a tool for navigating around the website. Photo gallery groups a collection of images together and displays them as an animated slideshow. Search engine optimization (SEO) tools optimise a websites search ranking with the likes of Google, Yahoo! and Bing. Collect website metrics such as page popularity and number of website hits using Google Analytics. WebPlus X5 introduced a button studio for creating button graphics. Restrict access to specific pages on a website with a secure member's area. WebPlus automatically converts images and graphics into a web targeted format, optimising them for fast download. Embed YouTube videos within a web page. Add animated effects to a website with Animated GIFs, Animated Marquees or by importing Flash videos. Stream news and information feeds to a website using RSS and podcasts. Automated Site Checker analyses and corrects potential problems with a website. AdSense tool incorporates Google AdSense advertisements into a website In-built FTP transfers files onto a web server, uploading a website to the internet. In-built Basic Photo Editor the PhotoLab can make automatic adjustments and "Quick Fix's" to photos. From X5, WebPlus offers image editing and filters, through its PhotoLab and also provides a dedicated background-removal tool in the form of Cutout Studio. Display images, Flash videos and web pages using animated Lightboxes. Filter Effects can be applied to the graphical objects, giving convincing, realistic effects such as glass, metallic, plastic and other 2D/3D filters. WebPlus also provides QuickShapes for creating button and web graphics. These predefined shapes can be quickly modified with sliders to adjust certain parameters, for example creating rounded rectangles, etc. Shapes include: rectangles, ellipses, stars, spirals, cogs, petals, etc.

    Read more →
  • Jais (language model)

    Jais (language model)

    Jais is an open-source large language model launched in August 2023. Developed as a collaboration between Emirati AI company G42, the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), and US-based Cerebras Systems, Jais was designed to produce high-quality Arabic text and was also trained on English data. The model's creation was motivated by the underrepresentation of the Arabic language in the field of generative artificial intelligence. It aims to provide a more culturally and linguistically accurate model for the world's 400 million Arabic speakers. Its name is a reference to Jebel Jais, the highest mountain in the UAE. == Background and development == Jais was developed in response to the limited availability of advanced generative artificial intelligence models for the Arabic language, despite it being spoken by over 400 million people. Existing models were often trained on limited or low-quality Arabic web content, resulting in poor performance. The project represents a significant investment by the United Arab Emirates in the field of AI as part of its national strategy. The model was created through a partnership between Inception (now Core42), a subsidiary of the Abu Dhabi-based AI company G42; the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); and Cerebras Systems, a US company specializing in AI hardware. The model is named after Jebel Jais, the highest peak in the UAE. == Training == The initial version of Jais released in August 2023 had 13 billion parameters. In November 2023, Core42 released Jais 30B, an improved version with 30 billion parameters. Both models were trained on a subset of the Cerebras Condor Galaxy 1 supercomputer. The training dataset consisted of a mix of Arabic, English, and computer code. According to Timothy Baldwin, a professor of natural language processing at MBZUAI, training the model on a diverse Arabic dataset allows it to switch between dialects. == Features == Jais is designed to generate text in both English and Arabic. The project has also released instruction-tuned "Chat" variants for both the 13B and 30B models, which are specifically optimized for conversational applications. Additional functionality for working with images, graphs, and tabular data is planned for future releases.

    Read more →
  • Brill tagger

    Brill tagger

    The Brill tagger is an inductive method for part-of-speech tagging. It was described and invented by Eric Brill in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is: a form of supervised learning, which aims to minimize error; and, a transformation-based process, in the sense that a tag is assigned to each word and changed using a set of predefined rules. In the transformation process, if the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. High accuracy is eventually achieved by applying these rules iteratively and changing the incorrect tags. This approach ensures that valuable information such as the morphosyntactic construction of words is employed in an automatic tagging process. == Algorithm == The algorithm starts with initialization, which is the assignment of tags based on their probability for each word (for example, "dog" is more often a noun than a verb). Then "patches" are determined via rules that correct (probable) tagging errors made in the initialization phase: Initialization: Known words (in vocabulary): assigning the most frequent tag associated to a form of the word Unknown word == Rules and processing == The input text is first tokenized, or broken into words. Typically in natural language processing, contractions such as "'s", "n't", and the like are considered separate word tokens, as are punctuation marks. A dictionary and some morphological rules then provide an initial tag for each word token. For example, a simple lookup would reveal that "dog" may be a noun or a verb (the most frequent tag is simply chosen), while an unknown word will be assigned some tag(s) based on capitalization, various prefix or suffix strings, etc. (such morphological analyses, which Brill calls Lexical Rules, may vary between implementations). After all word tokens have (provisional) tags, contextual rules apply iteratively, to correct the tags by examining small amounts of context. This is where the Brill method differs from other part of speech tagging methods such as those using Hidden Markov Models. Rules are reapplied repeatedly, until a threshold is reached, or no more rules can apply. Brill rules are of the general form: tag1 → tag2 IF Condition where the Condition tests the preceding and/or following word tokens, or their tags (the notation for such rules differs between implementations). For example, in Brill's notation: IN NN WDPREVTAG DT while would change the tag of a word from IN (preposition) to NN (common noun), if the preceding word's tag is DT (determiner) and the word itself is "while". This covers cases like "all the while" or "in a while", where "while" should be tagged as a noun rather than its more common use as a conjunction (many rules are more general). Rules should only operate if the tag being changed is also known to be permissible, for the word in question or in principle (for example, most adjectives in English can also be used as nouns). Rules of this kind can be implemented by simple Finite-state machines. See Part of speech tagging for more general information including descriptions of the Penn Treebank and other sets of tags. Typical Brill taggers use a few hundred rules, which may be developed by linguistic intuition or by machine learning on a pre-tagged corpus. == Code == Brill's code pages at Johns Hopkins University are no longer on the web. An archived version of a mirror of the Brill tagger at its latest version as it was available at Plymouth Tech can be found on Archive.org. The software uses the MIT License.

    Read more →
  • Realization (linguistics)

    Realization (linguistics)

    In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones. Realization is also a subtask of natural language generation, which involves creating an actual text in a human language (English, French, etc.) from a syntactic representation. There are a number of software packages available for realization, most of which have been developed by academic research groups in NLG. The remainder of this article concerns realization of this kind. == Example == For example, the following Java code causes the simplenlg system [2] to print out the text The women do not smoke.: In this example, the computer program has specified the linguistic constituents of the sentence (verb, subject), and also linguistic features (plural subject, negated), and from this information the realiser has constructed the actual sentence. == Processing == Realisation involves three kinds of processing: Syntactic realisation: Using grammatical knowledge to choose inflections, add function words and also to decide the order of components. For example, in English the subject usually precedes the verb, and the negated form of smoke is do not smoke. Morphological realisation: Computing inflected forms, for example the plural form of woman is women (not womans). Orthographic realisation: Dealing with casing, punctuation, and formatting. For example, capitalising The because it is the first word of the sentence. The above examples are very basic, most realisers are capable of considerably more complex processing. == Systems == A number of realisers have been developed over the past 20 years. These systems differ in terms of complexity and sophistication of their processing, robustness in dealing with unusual cases, and whether they are accessed programmatically via an API or whether they take a textual representation of a syntactic structure as their input. There are also major differences in pragmatic factors such as documentation, support, licensing terms, speed and memory usage, etc. It is not possible to describe all realisers here, but a few of the emerging areas are: Simplenlg [3]: a document realizing engine with an api which intended to be simple to learn and use, focused on limiting scope to only finding the surface area of a document. KPML [4]: this is the oldest realiser, which has been under development under different guises since the 1980s. It comes with grammars for ten different languages. FUF/SURGE [5]: a realiser which was widely used in the 1990s, and is still used in some projects today OpenCCG [6]: an open-source realiser which has a number of nice features, such as the ability to use statistical language models to make realisation decisions.

    Read more →
  • EPages

    EPages

    ePages is an e-commerce software that allows merchants to create and run online shops in the cloud. The number of shops based on ePages is currently 140,000 worldwide. ePages software is regularly updated due to its Software-as-a-Service model. An investor in the company is United Internet, with a 25% stake. ePages focuses upon distributing its products mainly through hosting providers. ePages is headquartered in Hamburg, with additional offices Barcelona, Jena, and Bilbao. == History == The name ePages was used for the first time for software in 1997 to market "Intershop ePages". In 2002, the product line then called Intershop 4 was taken over by ePages GmbH and renamed to ePages. == Features == Depending on the ePages product and packages offered by hosting providers, merchants can sell up to an unlimited number of items. Users can offer their products and services in 15 languages and with all currencies. With ePages, merchants can use web marketing tools; e.g. newsletters, coupons or social media plug-ins for social commerce.

    Read more →
  • Neural processing unit

    Neural processing unit

    A neural processing unit (NPU), also known as an AI accelerator or deep learning processor, is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. == Use == Their purpose is either to efficiently execute already trained AI models (inference) or to train AI models. NPUs can be more efficient in terms of speed or power consumption. NPU applications include algorithms for robotics, Internet of things, and data-intensive or sensor-driven tasks. They are often manycore or spatial designs and focus on low-precision arithmetic, novel dataflow architectures, or in-memory computing capability. As of 2024, a widely used datacenter-grade AI integrated circuit chip, the Nvidia H100 GPU, contains tens of billions of MOSFETs. === Consumer devices === AI accelerators are used in Apple silicon, Qualcomm, Samsung, Huawei, and Google Tensor smartphone processors. Vision processing units are accelerators specialized for machine vision algorithms such as CNN (convolutional neural networks) and SIFT (scale-invariant feature transform). They are used in devices that need to keep track of objects visually such as AR headsets and drones. It is more recently (circa 2017) added to processors from Apple and (circa 2022) to processors from Intel and AMD. All models of Intel Meteor Lake processors have a built-in versatile processor unit (VPU) for accelerating inference for computer vision and deep learning. On consumer devices, the NPU is intended to be small, power-efficient, but reasonably fast when used to run small models. To do this they are designed to support low-bitwidth operations using data types such as INT4, INT8, FP8, and FP16. A common metric is trillions of operations per second (TOPS). Although TOPS does not explicitly specify the kind of operations, it is typically INT8 additions and multiplications. === Datacenters === Accelerators are used in cloud computing servers: e.g., tensor processing units (TPU) for Google Cloud Platform, and Trainium and Inferentia chips for Amazon Web Services. Many vendor-specific terms exist for devices in this category, and it is an emerging technology without a dominant design. Since the late 2010s, graphics processing units designed by companies such as Nvidia and AMD often include AI-specific hardware in the form of dedicated functional units for low-precision matrix-multiplication operations. These GPUs are commonly used as AI accelerators, both for training and inference. === Scientific computation === Although NPUs are tailored for low-precision (e.g., FP16, INT8) matrix multiplication operations, they can be used to emulate higher-precision matrix multiplications in scientific computing. As modern GPUs place much focus on making the NPU part fast, using emulated FP64 (Ozaki scheme) on NPUs can potentially outperform native FP64. This has been demonstrated using FP16-emulated FP64 on NVIDIA TITAN RTX and using INT8-emulated FP64 on NVIDIA consumer GPUs and the A100 GPU. Consumer GPUs especially benefited as they have limited FP64 hardware capacity, showing a 6× speedup. Since CUDA Toolkit 13.0 Update 2, cuBLAS automatically uses INT8-emulated FP64 matrix multiplication of the equivalent precision if it is faster than native. This is in addition to the FP16-emulated FP32 feature introduced in version 12.9. == Programming == An operating system or a higher-level library may provide application programming interfaces such as TensorFlow with LiteRT Next (Android), CoreML (iOS, macOS) or DirectML (Windows). Formats such as ONNX are used to represent trained neural networks. Consumer CPU-integrated NPUs are accessible through vendor-specific APIs. AMD (Ryzen AI), Intel (OpenVINO), Apple silicon (CoreML), and Qualcomm (SNPE) each have their own APIs, which can be built upon by a higher-level library. GPUs generally use existing GPGPU pipelines such as CUDA and OpenCL adapted for lower precisions and specialized matrix-multiplication operations. Vulkan is also being used. Custom-built systems such as the Google TPU use private interfaces. There are a large number of separate underlying acceleration APIs and compilers/runtimes in use in the AI field, causing a great increase in software development effort due to the many combinations involved. As of 2025, the open standard organization Khronos Group is pursuing standardization of AI-related interfaces to reduce the amount of work needed. Khronos is working on three separate fronts: expansion of data types and intrinsic operations in OpenCL and Vulkan, inclusion of compute graphs in SPIR-V, and a NNEF/SkriptND file format for describing a neural network.

    Read more →
  • Automatic acquisition of sense-tagged corpora

    Automatic acquisition of sense-tagged corpora

    The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises. == Existing methods == Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically. WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD. The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are: acquisition by direct Web searching (searches for monosemous synonyms, hypernyms, hyponyms, parsed gloss' words, etc.), Yarowsky algorithm (bootstrapping), acquisition via Web directories, and acquisition via cross-language meaning evidences. == Summary == === Optimistic results === The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word-sense disambiguation. Some results are certainly encouraging: In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique and the examples taken from the ODP Web directories. In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved. It has been shown that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2. Web examples made a significant contribution to the best Senseval-2 English all-words system. === Difficulties === There are, however, several open research issues related to the use of Web examples in WSD: High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training). The most complete evaluation of Web examples for supervised WSD indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline. Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002) with Agirre and Martínez (2004), or Agirre and Martínez (2000) with Mihalcea and Moldovan (1999). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples). Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense). It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus). In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense or assigning senses to Web directories and then scanning directories for examples; but this problem is nevertheless far from being solved. Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective. === Future === Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web: The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project), which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual). The Web has been used to enrich WordNet senses with domain information: topic signatures and Web directories, which have in turn been successfully used for WSD. Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages. It is clear, however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

    Read more →
  • Smart data capture

    Smart data capture

    Smart data capture (SDC), also known as 'intelligent data capture' or 'automated data capture', describes the branch of technology concerned with using computer vision techniques like optical character recognition (OCR), barcode scanning, object recognition and other similar technologies to extract and process information from semi-structured and unstructured data sources. IDC characterize smart data capture as an integrated hardware, software, and connectivity strategy to help organizations enable the capture of data in an efficient, repeatable, scalable, and future-proof way. Data is captured visually from barcodes, text, IDs and other objects - often from many sources simultaneously - before being converted and prepared for digital use, typically by artificial intelligence-powered software. An important feature of SDC is that it focuses not just on capturing data more efficiently but serving up easy-to-access, actionable insights at the instant of data collection to both frontline and desk-based workers, aiding decision-making and making it a two-way process. Smart data capture automates and accelerates capture, applying insights in real time and automating processes based on extracted input. Smart data capture is designed to be repeatable and scalable to reduce low-level manual tasks and eliminate human error. To achieve this goal, smart data capture solutions are often made available using specialist software installed on commodity hardware such as smartphones. However, some solutions may rely on specialized hardware such as dedicated scanning devices, wearables or shop floor robots. == Differences from OCR == Optical character recognition applications are typically concerned with the actual data capture process; they are intended to faithfully reproduce text, words, letters and symbols from a printed document. Smart data capture is multimodal, capable of extracting data from a wider range of semi-structured and unstructured sources, going beyond basic text recognition to offer a wider scope of applications. By extending functionality to provide actionable insights at the point of capture, SDC is also a two-way process (capture-display), while OCR is more commonly one-way (capture only), primarily used for data input. Smart data capture solutions typically have two parts: Data capture (which includes OCR, barcode scanning, object recognition) Functionality that then uses this data to provide actionable insights at the point of capture. == Applications == Smart data capture can be applied to almost any industry and application that requires visual information capture and interpretation. This may include: Retail Warehouse inventory control Logistics, handling and shipping Manufacturing Field service Healthcare Transport and travel Fraud detection

    Read more →
  • Machine-learned interatomic potential

    Machine-learned interatomic potential

    Machine-learned interatomic potentials (MLIPs), or simply machine learning potentials (MLPs), are interatomic potentials constructed using machine learning. Beginning in the 1990s, researchers have employed such programs to construct interatomic potentials by mapping atomic structures to their potential energies. These potentials are referred to as MLIPs or MLPs. Such machine learning potentials promised to fill the gap between density functional theory, a highly accurate but computationally intensive modelling method, and empirically derived or intuitively-approximated potentials, which were far lighter computationally but substantially less accurate. Improvements in artificial intelligence technology heightened the accuracy of MLPs while lowering their computational cost, increasing the role of machine learning in fitting potentials. Machine learning potentials began by using neural networks to tackle low-dimensional systems. While promising, these models could not systematically account for interatomic energy interactions; they could be applied to small molecules in a vacuum, or molecules interacting with frozen surfaces, but not much else – and even in these applications, the models often relied on force fields or potentials derived empirically or with simulations. These models thus remained confined to academia. Modern neural networks construct highly accurate and computationally light potentials, as theoretical understanding of materials science was increasingly built into their architectures and preprocessing. Almost all are local, accounting for all interactions between an atom and its neighbor up to some cutoff radius. There exist some nonlocal models, but these have been experimental for almost a decade. For most systems, reasonable cutoff radii enable highly accurate results. Almost all neural networks intake atomic coordinates and output potential energies. For some, these atomic coordinates are converted into atom-centered symmetry functions. From this data, a separate atomic neural network is trained for each element; each atomic network is evaluated whenever that element occurs in the given structure, and then the results are pooled together at the end. This process – in particular, the atom-centered symmetry functions which convey translational, rotational, and permutational invariances – has greatly improved machine learning potentials by significantly constraining the neural network search space. Other models use a similar process but emphasize bonds over atoms, using pair symmetry functions and training one network per atom pair. Other models to learn their own descriptors rather than using predetermined symmetry-dictating functions. These models, called message-passing neural networks (MPNNs), are graph neural networks. Treating molecules as three-dimensional graphs (where atoms are nodes and bonds are edges), the model takes feature vectors describing the atoms as input, and iteratively updates these vectors as information about neighboring atoms is processed through message functions and convolutions. These feature vectors are then used to predict the final potentials. The flexibility of this method often results in stronger, more generalizable models. In 2017, the first-ever MPNN model (a deep tensor neural network) was used to calculate the properties of small organic molecules. == Gaussian Approximation Potential (GAP) == One popular class of machine-learned interatomic potential is the Gaussian Approximation Potential (GAP), which combines compact descriptors of local atomic environments with Gaussian process regression to machine learn the potential energy surface of a given system. To date, the GAP framework has been used to successfully develop a number of MLIPs for various systems, including for elemental systems such as carbon, silicon, phosphorus, and tungsten, as well as for multicomponent systems such as Ge2Sb2Te5 and austenitic stainless steel, Fe7Cr2Ni. == Equivariant graph neural networks == A significant limitation of early MPNNs was that they were not inherently equivariant to rotations and reflections of atomic structures — meaning predictions could change depending on how a molecule was oriented in space. Beginning around 2021, a new class of models addressed this by incorporating equivariance directly into the message-passing layers using spherical harmonics and irreducible representations. Notable examples include NequIP (2021), MACE (2022), and GemNet-OC (2022). These equivariant architectures proved substantially more data-efficient and accurate than their predecessors, and became the dominant paradigm for high-accuracy MLIPs. == Universal MLIPs and large-scale datasets == Early MLIPs were system-specific, trained on a few thousand structures of a single material. A major shift occurred with the creation of large, chemically diverse datasets enabling models that generalize across many elements, bonding environments, and application domains — so-called universal MLIPs. A key driver was the Open Catalyst Project (OC20, OC22), a collaboration between Meta AI (FAIR) and Carnegie Mellon University launched in 2020. OC20 comprises approximately 1.3 million DFT relaxations across 82 elements, designed to accelerate the discovery of catalysts for renewable energy applications. It was among the first datasets large enough to train GNNs that generalize across diverse chemical systems, and established a widely-used benchmark for the field. A subsequent dataset, Open Direct Air Capture (OpenDAC 2023 and OpenDAC 2025), applied the same approach to carbon capture, providing a large computational database of metal-organic frameworks and sorbent candidates evaluated for CO₂ capture, generated using nearly 400 million CPU hours of quantum chemistry calculations in collaboration with Georgia Tech. These datasets revealed a new challenge: the GNN architectures most effective for atomic simulations were memory-intensive, as they model higher-order interactions between triplets or quadruplets of atoms, making it difficult to scale model size. Graph Parallelism, introduced by Sriram et al. (ICLR 2022), addressed this by distributing a single input graph across multiple GPUs — a distinct strategy from data parallelism (which distributes training examples) or model parallelism (which distributes layers). This enabled training GNNs with hundreds of millions to billions of parameters for the first time. Building on these foundations, Meta FAIR released the Universal Model for Atoms (UMA) in 2025, trained on approximately 500 million unique 3D atomic structures spanning molecules, materials, and catalysts — the largest training run to date for an MLIP. UMA introduced a Mixture of Linear Experts (MoLE) architecture, enabling one model to learn from datasets generated by different DFT codes and settings without significant inference overhead. It matches or surpasses specialized models across catalysis, materials, and molecular benchmarks without task-specific fine-tuning, and has been described as marking a "pre/post-UMA" divide in the field. == Applications == Catalyst discovery: MLIPs have significantly accelerated the computational screening of heterogeneous catalysts by replacing expensive DFT relaxations with fast neural network surrogates. The Open Catalyst Project explicitly targets this application, aiming to identify new catalysts for green hydrogen production and other renewable energy reactions. Carbon capture: The OpenDAC project applies universal MLIPs to screening sorbent materials for direct air capture of CO₂, a key technology for climate change mitigation. AI-accelerated screening allows evaluation of orders of magnitude more candidate materials than traditional DFT workflows. Drug discovery and molecular design: MLIPs are increasingly used in pharmaceutical research to model molecular conformations and binding energies. The Open Molecules 2025 (OMol25) dataset, released by Meta FAIR in 2025, provides high-accuracy calculations for a large set of molecular systems to support this use case. Materials discovery: Universal MLIPs enable high-throughput screening of novel inorganic materials, including battery electrolytes, semiconductors, and superconductors, by rapidly estimating stability and properties across large chemical spaces.

    Read more →
  • LIVAC Synchronous Corpus

    LIVAC Synchronous Corpus

    LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations. The "Windows" approach is the most innovative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible. Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) search and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 25 years, based on the above mentioned variables of location, time and subject. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Newsmaker Rosters). Notable among these are the decades long periodic reviews of the 25 years of annual pan-Chinese rosters since 2000 and compilation of new word databases (LIVAC Annual Pan-Chinese New Word Rosters). On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible. A recent focus is on the relative balance between disyllabic words and growing trisyllabic words in the Chinese language, and the comparative study of light verbs in three Chinese speech communities. as well as the link between the language use and use of language as a reflection of epochal change in China. A new LIVAC version 3.1 was launched in February 2024. == Corpus data processing == Accessing media texts, manual input, etc. Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions Automatic word segmentation Automatic alignment of parallel texts Manual verification, part-of-speech tagging Extraction of words and addition to regional sub-corpora Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database == Labeling for data curation == Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc. Construction of databases of proper names, place names, and specific terms, etc. Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc. == Applications == Compilation of Pan-Chinese dictionaries or local dictionaries Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining Comparative studies on linguistic and cultural developments in the Pan-Chinese regions, especially in a critical period of history in modern China. Language teaching and learning research, and speech-to-text conversion Customized service on linguistic research and lexical search for international corporations and government agencies The above applications are provided by the following functions: Word Segmentation Search Phrase Search Example Sentence Selection Multi-word Comparison Word Cloud

    Read more →
  • Large language model

    Large language model

    A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable. As of 2026, the most capable LLMs are based on transformer architectures, which, according to the 2017 paper "Attention Is All You Need", can be more efficient and parallelizable than earlier statistical and recurrent neural network models. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. == History == Before the emergence of transformer-based models in 2017, some language models were considered large relative to the computational and data constraints of their time. In the early 1990s, IBM's statistical models pioneered word alignment techniques for machine translation, laying the groundwork for corpus-based language modeling. In 2001, a smoothed n-gram model, such as those employing Kneser–Ney smoothing, trained on 300 million words, achieved state-of-the-art perplexity on benchmark tests. During the 2000s, with the rise of widespread internet access, researchers began compiling massive text datasets from the web ("web as corpus") to train statistical language models. Moving beyond n-gram models, researchers started in 2000 to use neural networks as language models. Following the breakthrough of deep neural networks in image classification around 2012, similar architectures were adapted for language tasks. This shift was marked by the development of word embeddings (e.g., Word2Vec by Mikolov in 2013) and sequence-to-sequence (seq2seq) models using LSTM. In 2016, Google transitioned its translation service to neural machine translation (NMT), replacing statistical phrase-based models with deep recurrent neural networks. These early NMT systems used LSTM-based encoder-decoder architectures, as they preceded the invention of transformers. At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 seq2seq technology, and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014. The following year in 2018, BERT was introduced and quickly became "ubiquitous". Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model. Academic and research usage of BERT began to decline in 2023, following rapid improvements in the abilities of decoder-only models (such as GPT) to solve tasks via prompting. Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI claimed to have initially deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2025 is available only via API with no offering of downloading the model to execute locally. But it was the consumer-facing chatbot ChatGPT in late 2022 that received extensive media coverage and public attention by 2023. The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities. OpenAI did not reveal the high-level architecture and the number of parameters of GPT-4. The release of ChatGPT led to an uptick in LLM usage across several research subfields of computer science, including robotics, software engineering, and societal impact work. In 2024, OpenAI released the reasoning model OpenAI o1, which generates long chains of thought before returning a final answer. Many LLMs with parameter counts comparable to those of OpenAI's GPT series have been developed. Since 2022, weights-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on usage and deployment. Mistral AI's open-weight models Mistral 7B and Mixtral 8x7B have a more permissive Apache License. In January 2025, DeepSeek released DeepSeek R1, a 671-billion-parameter open-weight model that performs comparably to OpenAI o1 but at a much lower price per token for users. Since 2023, many LLMs have been trained to be multimodal, having the ability to also process or generate other types of data, such as images, audio, or 3D meshes. Open-weight LLMs have become more influential since 2023. Per Vake et al. (2025), community-driven contributions to open-weight models improve their efficiency and performance via collaborative platforms such as Hugging Face. == Dataset preprocessing == === Tokenization === As machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated with the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters, such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT and "##" denotes continuation of a preceding word in BERT. For example, the BPE tokenizer used by the legacy version of GPT-3 would split tokenizer: texts -> series of numerical "tokens" as Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. ==== Byte-pair encoding ==== As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained. After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams. === Dataset cleaning === In the context of training LLMs, datasets are typically cleaned by removing low-quality, duplicated, or toxic data. Cleaned datasets can increase training efficiency and lead to improved downstream performance. A trained LLM can be used to clean datasets for training a further LLM. With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it). === Synthetic data === Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. == Training == An LLM is a type of foundation model (large X model) trained on language. LLMs can be trained in different ways. In particular, GPT models are first pretrained to predict the next word on a large amount of data, before being fine-tuned. === Cost === Substantial infrastructure is necessary for training the largest models. The tendency towards larger models is visible in the list of large language models. For example, the training of GPT-2 (i.e. a 1.5-billion-parameter model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameter model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million. The qualifier "large" in "large language model" is inherently vague, as there is no definitive threshold for the number of parameters required to qualify as "large". === Fine-tuning === Before being fine-tuned, most LLMs are next-token predictors. The fine-tuning shapes the LLM's behavior via techniques like reinforcement learning from human feedback (RLHF) or constitutional AI. Instruction fine-tuning is a form of supervised learning used to teach LLMs to follow user instructions. In 2022, OpenAI demonstrated InstructGPT, a version of GPT-3 similarly fine-tuned to follow instructions. Reinforcement learning from human feedback (RLHF) involves training a reward model to predict which text humans prefer. Then, the LLM can be fine-tuned through reinforcement learning to better satisfy this reward model. Since humans typically prefer truthful, helpful and harmless answers, RLHF favors such answers. == Architecture == LLMs are generally based on the tra

    Read more →