Tensor (machine learning)

Tensor (machine learning)

In machine learning, the term tensor informally refers to two different concepts: (i) a way of organizing data and (ii) a multilinear (tensor) transformation. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods. Tensor decomposition factors data tensors into smaller tensors. Operations on data tensors can be expressed in terms of matrix multiplication and the Kronecker product. The computation of gradients, a crucial aspect of backpropagation, can be performed using software libraries such as PyTorch and TensorFlow. Computations are often performed on graphics processing units (GPUs) using CUDA, and on dedicated hardware such as Google's Tensor Processing Unit or Nvidia's Tensor core. These developments have greatly accelerated neural network architectures, and increased the size and complexity of models that can be trained. == History == A tensor is by definition a multilinear map. In mathematics, this may express a multilinear relationship between sets of algebraic objects. In physics, tensor fields, considered as tensors at each point in space, are useful in expressing mechanics such as stress or elasticity. In machine learning, the exact use of tensors depends on the statistical approach being used. In 2001, the field of signal processing and statistics were making use of tensor methods. Pierre Comon surveys the early adoption of tensor methods in the fields of telecommunications, radio surveillance, chemometrics and sensor processing. Linear tensor rank methods (such as, Parafac/CANDECOMP) analyzed M-way arrays ("data tensors") composed of higher order statistics that were employed in blind source separation problems to compute a linear model of the data. He noted several early limitations in determining the tensor rank and efficient tensor rank decomposition. In the early 2000s, multilinear tensor methods crossed over into computer vision, computer graphics and machine learning with papers by Vasilescu or in collaboration with Terzopoulos, such as Human Motion Signatures, TensorFaces TensorTextures and Multilinear Projection. Multilinear algebra, the algebra of higher-order tensors, is a suitable and transparent framework for analyzing the multifactor structure of an ensemble of observations and for addressing the difficult problem of disentangling the causal factors based on second order or higher order statistics associated with each causal factor. Tensor (multilinear) factor analysis disentangles and reduces the influence of different causal factors with multilinear subspace learning. When treating an image or a video as a 2- or 3-way array, i.e., "data matrix/tensor", tensor methods reduce spatial or time redundancies as demonstrated by Wang and Ahuja. Yoshua Bengio, Geoff Hinton and their collaborators briefly discuss the relationship between deep neural networks and tensor factor analysis beyond the use of M-way arrays ("data tensors") as inputs. One of the early uses of tensors for neural networks appeared in natural language processing. A single word can be expressed as a vector via Word2vec. Thus a relationship between two words can be encoded in a matrix. However, for more complex relationships such as subject-object-verb, it is necessary to build higher-dimensional networks. In 2009, the work of Sutskever introduced Bayesian Clustered Tensor Factorization to model relational concepts while reducing the parameter space. From 2014 to 2015, tensor methods become more common in convolutional neural networks (CNNs). Tensor methods organize neural network weights in a "data tensor", analyze and reduce the number of neural network weights. Lebedev et al. accelerated CNN networks for character classification (the recognition of letters and digits in images) by using 4D kernel tensors. == Definition == Let F {\displaystyle \mathbb {F} } be a field (such as the real numbers R {\displaystyle \mathbb {R} } or the complex numbers C {\displaystyle \mathbb {C} } ). A tensor T ∈ F I 1 × I 2 × … × I C {\displaystyle {\mathcal {T}}\in {\mathbb {F} }^{I_{1}\times I_{2}\times \ldots \times I_{C}}} is a multilinear transformation from a set of domain vector spaces to a range vector space: T : { F I 1 × F I 2 × … F I C } ↦ F I 0 {\displaystyle {\mathcal {T}}:\{{\mathbb {F} }^{I_{1}}\times {\mathbb {F} }^{I_{2}}\times \ldots {\mathbb {F} }^{I_{C}}\}\mapsto {\mathbb {F} }^{I_{0}}} Here, C {\displaystyle C} and I 0 , I 1 , … , I C {\displaystyle I_{0},I_{1},\ldots ,I_{C}} are positive integers, and ( C + 1 ) {\displaystyle (C+1)} is the number of modes of a tensor (also known as the number of ways of a multi-way array). The dimensionality of mode c {\displaystyle c} is I c {\displaystyle I_{c}} , for 0 ≤ c ≤ C {\displaystyle 0\leq c\leq C} . In statistics and machine learning, an image is vectorized when viewed as a single observation, and a collection of vectorized images is organized as a "data tensor". For example, a set of facial images { d i p , i e , i l , i v ∈ R I X } {\displaystyle \{{\mathbb {d} }_{i_{p},i_{e},i_{l},i_{v}}\in {\mathbb {R} }^{I_{X}}\}} with I X {\displaystyle I_{X}} pixels that are the consequences of multiple causal factors, such as a facial geometry i p ( 1 ≤ i p ≤ I P ) {\displaystyle i_{p}(1\leq i_{p}\leq I_{P})} , an expression i e ( 1 ≤ i e ≤ I E ) {\displaystyle i_{e}(1\leq i_{e}\leq I_{E})} , an illumination condition i l ( 1 ≤ i l ≤ I L ) {\displaystyle i_{l}(1\leq i_{l}\leq I_{L})} , and a viewing condition i v ( 1 ≤ i v ≤ I V ) {\displaystyle i_{v}(1\leq i_{v}\leq I_{V})} may be organized into a data tensor (ie. multiway array) D ∈ R I X × I P × I E × I L × V {\displaystyle {\mathcal {D}}\in {\mathbb {R} }^{I_{X}\times I_{P}\times I_{E}\times I_{L}\times V}} where I P {\displaystyle I_{P}} are the total number of facial geometries, I E {\displaystyle I_{E}} are the total number of expressions, I L {\displaystyle I_{L}} are the total number of illumination conditions, and I V {\displaystyle I_{V}} are the total number of viewing conditions. Tensor factorizations methods such as TensorFaces and multilinear (tensor) independent component analysis factorizes the data tensor into a set of vector spaces that span the causal factor representations, where an image is the result of tensor transformation T {\displaystyle {\mathcal {T}}} that maps a set of causal factor representations to the pixel space. Another approach to using tensors in machine learning is to embed various data types directly. For example, a grayscale image, commonly represented as a discrete 2-way array D ∈ R I R X × I C X {\displaystyle {\mathbf {D} }\in {\mathbb {R} }^{I_{RX}\times I_{CX}}} with dimensionality I R X × I C X {\displaystyle I_{RX}\times I_{CX}} where I R X {\displaystyle I_{RX}} are the number of rows and I C X {\displaystyle I_{CX}} are the number of columns. When an image is treated as 2-way array or 2nd order tensor (i.e. as a collection of column/row observations), tensor factorization methods compute the image column space, the image row space and the normalized PCA coefficients or the ICA coefficients. Similarly, a color image with RGB channels, D ∈ R N × M × 3 . {\displaystyle {\mathcal {D}}\in \mathbb {R} ^{N\times M\times 3}.} may be viewed as a 3rd order data tensor or 3-way array.-------- In natural language processing, a word might be expressed as a vector v {\displaystyle v} via the Word2vec algorithm. Thus v {\displaystyle v} becomes a mode-1 tensor v ↦ A ∈ R N . {\displaystyle v\mapsto {\mathcal {A}}\in \mathbb {R} ^{N}.} The embedding of subject-object-verb semantics requires embedding relationships among three words. Because a word is itself a vector, subject-object-verb semantics could be expressed using mode-3 tensors v a × v b × v c ↦ A ∈ R N × N × N . {\displaystyle v_{a}\times v_{b}\times v_{c}\mapsto {\mathcal {A}}\in \mathbb {R} ^{N\times N\times N}.} In practice the neural network designer is primarily concerned with the specification of embeddings, the connection of tensor layers, and the operations performed on them in a network. Modern machine learning frameworks manage the optimization, tensor factorization and backpropagation automatically. === As unit values === Tensors may be used as the unit values of neural networks which extend the concept of scalar, vector and matrix values to multiple dimensions. The output value of single layer unit y m {\displaystyle y_{m}} is the sum-product of its input units and the connection weights filtered through the activation function f {\displaystyle f} : y m = f ( ∑ n x n u m , n ) , {\displaystyle y_{m}=f\left(\sum _{n}x_{n}u_{m,n}\right),} where y m ∈ R .

Confused deputy problem

In information security, a confused deputy is a computer program that is tricked by another program (with fewer privileges or less rights) into misusing its authority on the system. It is a specific type of privilege escalation. The confused deputy problem is often cited as an example of why capability-based security is important. Capability systems protect against the confused deputy problem, whereas access-control list–based systems do not. Such systems can mitigate the confused deputy problem by eliminating ambient authority, allowing programs to act only on resources for which they hold explicit capabilities, whereas access-control list–based systems are more susceptible to it. However, this protection depends on correct implementation; in formally verified capability systems such as seL4, it can be shown that the kernel enforces capability constraints correctly, preventing such behavior at the system level. == Example == In the original example of a confused deputy, there was a compiler program provided on a commercial timesharing service. Users could run the compiler and optionally specify a filename where it would write debugging output, and the compiler would be able to write to that file if the user had permission to write there. The compiler also collected statistics about language feature usage. Those statistics were stored in a file called "(SYSX)STAT", in the directory "SYSX". To make this possible, the compiler program was given permission to write to files in SYSX. But there were other files in SYSX: in particular, the system's billing information was stored in a file "(SYSX)BILL". A user ran the compiler and named "(SYSX)BILL" as the desired debugging output file. This produced a confused deputy problem. The compiler made a request to the operating system to open (SYSX)BILL. Even though the user did not have access to that file, the compiler did, so the open succeeded. The compiler wrote the compilation output to the file (here "(SYSX)BILL") as normal, overwriting it, and the billing information was destroyed. === The confused deputy === In this example, the compiler program is the deputy because it is acting at the request of the user. The program is seen as 'confused' because it was tricked into overwriting the system's billing file. Whenever a program tries to access a file, the operating system needs to know two things: which file the program is asking for, and whether the program has permission to access the file. In the example, the file is designated by its name, “(SYSX)BILL”. The program receives the file name from the user, but does not know whether the user had permission to write the file. When the program opens the file, the system uses the program's permission, not the user's. When the file name was passed from the user to the program, the permission did not go along with it; the permission was increased by the system silently and automatically. It is not essential to the attack that the billing file be designated by a name represented as a string. The essential points are that: the designator for the file does not carry the full authority needed to access the file; the program's own permission to access the file is used implicitly. == Other examples == A cross-site request forgery (CSRF) is an example of a confused deputy attack that uses the web browser to perform sensitive actions against a web application. A common form of this attack occurs when a web application uses a cookie to authenticate all requests transmitted by a browser. Using JavaScript, an attacker can force a browser into transmitting authenticated HTTP requests. The Samy computer worm used cross-site scripting (XSS) to turn the browser's authenticated MySpace session into a confused deputy. Using XSS the worm forced the browser into posting an executable copy of the worm as a MySpace message which was then viewed and executed by friends of the infected user. Clickjacking is an attack where the user acts as the confused deputy. In this attack a user thinks they are harmlessly browsing a website (an attacker-controlled website) but they are in fact tricked into performing sensitive actions on another website. An FTP bounce attack can allow an attacker to connect indirectly to TCP ports to which the attacker's machine has no access, using a remote FTP server as the confused deputy. Another example relates to personal firewall software. It can restrict Internet access for specific applications. Some applications circumvent this by starting a browser with instructions to access a specific URL. The browser has authority to open a network connection, even though the application does not. Firewall software can attempt to address this by prompting the user in cases where one program starts another which then accesses the network. However, the user frequently does not have sufficient information to determine whether such an access is legitimate—false positives are common, and there is a substantial risk that even sophisticated users will become habituated to clicking "OK" to these prompts. Not every program that misuses authority is a confused deputy. Sometimes misuse of authority is simply a result of a program error. The confused deputy problem occurs when the designation of an object is passed from one program to another, and the associated permission changes unintentionally, without any explicit action by either party. It is insidious because neither party did anything explicit to change the authority. Another example is when an administrator authorizes an AI agent to act on their behalf, and that AI subsequently delegates authority to another AI agent neither vetted nor authorized by the original administrator. The unvetted AI can then act without permissions or oversight from the original developer. == Solutions == In some systems it is possible to ask the operating system to open a file using the permissions of another client. This solution has some drawbacks: It requires explicit attention to security by the server. A naive or careless server might not take this extra step. It becomes more difficult to identify the correct permission if the server is in turn the client of another service and wants to pass along access to the file. It requires the client to trust the server to not abuse the borrowed permissions. Note that intersecting the server and client's permissions does not solve the problem either, because the server may then have to be given very wide permissions (all of the time, rather than those needed for a given request) in order to act for arbitrary clients. The simplest way to solve the confused deputy problem is to bundle together the designation of an object and the permission to access that object. This is exactly what a capability is. Using capability security in the compiler example, the client would pass to the server a capability to the output file, such as a file descriptor, rather than the name of the file. Since it lacks a capability to the billing file, it cannot designate that file for output. In the cross-site request forgery example, a URL supplied "cross"-site would include its own authority independent of that of the client of the web browser.

Amália (LLM)

Amália is a Portuguese large language model (LLM) announced in November 2024 by the Portuguese Prime-Minister Luís Montenegro. Its final version is expected to be launched in 2026. It is being developed by Center for Responsible AI (Centro para a AI Responsável) and by the research centers of NOVA School of Science and Technology and Instituto Superior Técnico. == History == In 2024 it was announced that the Portuguese Agency for Administrative Modernization (Agência para a Modernização Administrativa) transpose this LLM to Portuguese Public Administration. According to Paulo Dimas (CEO of the Center for Responsible AI) the three fundamental points of this LLM project are the linguistic variant (European Portuguese), cultural representation and data protection. In April 2025 it was announced that Amália had entered beta phase with an improved version being expected to be launched in September 2025. The beta version released in September is available only to the Public Administration, but the website launched in October reiterates the final version will be an open model.

LMArena

Arena (formerly LMArena and Chatbot Arena) is a public, web-based platform that evaluates large language models (LLMs). Users enter prompts for two anonymous models to respond to and vote on the model that gave the better response, after which the models' identities are revealed. Users can also choose models to test themselves via the "Direct" selection. Companies which have supplied the company with their large language models include OpenAI, Google DeepMind, and Anthropic. The website has been used for preview releases of upcoming models. Chinese company DeepSeek tested its prototype models in the Arena months before its R1 model gained attention in Western media. Other notable pre-release models include OpenAI's GPT-5 under the codename "summit" and Google DeepMind's Gemini 2.5 Flash Image (an image-generation and editing model) under the codename "Nano Banana". Research has identified specific limitations in Arena's methodology. == History == Chatbot Arena was released on April 24, 2023. In June 2024, Chatbot Arena added image support. In September 2024, Chatbot Arena moved to its own dedicated domain name, lmarena.ai (or LMArena). In April 2025, Meta released Llama 4. Llama 4 Maverick beat GPT-4o and Gemini 2.0 Flash on LMArena, but the version of Maverick on LMArena unfairly differed from the publicly available version. LMArena updated their policies in response. In April 2025, LMArena incorporated as an independent company. That May, LMArena raised $100 million in a seed funding round, valuing the company at $600 million. Participants in the seed funding round included Andreessen Horowitz, UC Investments, Lightspeed Venture Partners, Felicis Ventures, and Kleiner Perkins. On January 6, 2026, LMArena announced the closing of a $150 million Series A funding round, bringing the company’s post-money valuation to approximately $1.7 billion. The round was led by Felicis and UC Investments (University of California), with participation from Andreessen Horowitz, The House Fund, LDVP, Kleiner Perkins, Lightspeed Venture Partners, and Laude Ventures. In January 2026, LMArena added video support. On January 28, 2026, LMArena rebranded to "Arena".

Colloquis

Colloquis, previously known as ActiveBuddy and Conversagent, was a company that created conversation-based interactive agents originally distributed via instant messaging platforms. The company had offices in New York, New York, and Sunnyvale, California. == History == Founded in 2000, the company was the brainchild of Robert Hoffer, Timothy Kay, and Peter Levitan. The idea for interactive agents (also known as Internet bots) came from the team's vision to add functionality to increasingly popular instant messaging services. The original implementation took shape as a word-based adventure game but quickly grew to include a wide range of database applications, including access to news, weather, stock information, movie times, Yellow Pages listings, and detailed sports data, as well as a variety of tools (calculators, translator, etc.). These various applications were bundled into one entity and launched as SmarterChild in 2001. SmarterChild acted as a showcase for the quick data access and possibilities for fun conversation that the company planned to turn into customized, niche-specific products. The rapid success of SmarterChild led to targeted promotional products for Radiohead, Austin Powers, The Sporting News, and others. ActiveBuddy sought to strengthen its hold on the interactive agent market for the future by filing for, and receiving, a controversial patent on their creation in 2002. The company also released the BuddyScript SDK, a free developer kit that allow programmers to design and launch their own interactive agents using ActiveBuddy's proprietary scripting language, in 2002. Ultimately, however, the decline in ad spending in 2001 and 2002 led to a shift in corporate strategy towards business focused Automated Service Agents, building products for clients including Cingular, Comcast and Cox Communications. The company subsequently changed its name from ActiveBuddy to Conversagent in 2003, and then again to Colloquis in 2006. Colloquis was purchased by Microsoft in October 2006.

Concordancer

A concordancer is a computer program that automatically constructs a concordance—an alphabetised index of every occurrence of a word or phrase in a body of text, each entry displayed with its surrounding context. Concordancers are primary tools in corpus linguistics, lexicography, computer-assisted translation, and language teaching. The most common display format is the key word in context (KWIC) layout, in which each hit appears centred on a line with a fixed span of words to its left and right, enabling rapid scanning of usage patterns across many occurrences. == History == === Pre-computational concordances === The compilation of concordances predates computers by many centuries. Around 1230, the French Dominican cardinal Hugh of Saint-Cher directed a team of friars in assembling a concordance of the Latin Vulgate Bible, generally regarded as the first systematic concordance of any text. To help readers locate passages, Hugh divided each biblical chapter into lettered sections. Later milestones include a Hebrew Old Testament concordance compiled by Rabbi Mordecai Nathan (1448), Alexander Cruden's Complete Concordance to the Holy Scriptures (1737), and the manuscript Asaf ha-Mazkir, an unfinished concordance to the Babylonian Talmud compiled by Moses Rigotz around the turn of the 19th century. === First computer concordance === The first concordance produced with computing assistance was the Index Thomisticus, a comprehensive lexical index of the writings of and around Thomas Aquinas, totalling approximately 10.6 million Latin words. The Italian Jesuit priest Roberto Busa conceived the project in 1946 and secured the sponsorship of IBM in 1949 after a meeting with chairman Thomas J. Watson. Keypunch operators in Gallarate, Italy, encoded the texts onto punched cards from around 1950. IBM executive Paul Tasman developed the processing methods. The full 56-volume printed edition was completed around 1980, followed by a CD-ROM edition in 1989 and a web-accessible version in 2005. === The KWIC format === The key word in context (KWIC) display was formalised as a computational technique by Hans Peter Luhn, a researcher at IBM, in a 1960 paper in American Documentation. In KWIC output, each instance of the search term (the node word) is centred on a line with a fixed window of words to each side; sorting the resulting lines alphabetically by the immediately adjacent word reveals collocational and phraseological patterns at a glance. === COCOA === One of the first dedicated concordancing programs was COCOA (COunt and COncordance Generation on Atlas), created in 1965 by D. B. Russell at University College London and the Atlas Computer Laboratory in Harwell, Oxfordshire. Written in approximately 4,000 cards of FORTRAN, it processed text annotated with flat, non-hierarchical markup tags and could produce word counts and concordances in multiple languages. Within its first six months COCOA had been applied to texts in at least six languages. A second version designed for multiple mainframe platforms was distributed to British computing centres in the mid-1970s. Growing dissatisfaction with its interface and the eventual withdrawal of Atlas Laboratory support prompted British funding bodies to commission a successor program. === Oxford Concordance Program === The Oxford Concordance Program (OCP) was designed and written in FORTRAN by Susan Hockey and Ian Marriott at Oxford University Computing Services (OUCS) between 1979 and 1980 and first released in 1981. Hockey and Marriott acknowledged that OCP owed much to COCOA and the CLOC system at the University of Birmingham. OCP accepted COCOA-format markup to encode metadata such as author, act, scene, and line number, and was described by its authors as "a machine-independent text analysis program for producing word lists, indices and concordances in a variety of languages and alphabets." By the mid-1980s it had been licensed to approximately 240 institutions in 23 countries. A personal computer version, Micro-OCP, was developed for the IBM PC and sold by Oxford University Press from the late 1980s. Version 2 was rewritten in 1985–86 and documented in the same 1987 article by Hockey and co-author John Martin. === Personal computer era === The availability of affordable personal computers in the 1980s and 1990s enabled standalone concordancing applications that analysts could run locally without specialist computing facilities. MicroConcord, developed by Mike Scott and Tim Johns and published by Oxford University Press in 1993 for MS-DOS, was among the first concordancers designed specifically for classroom language teaching. WordSmith Tools, also developed by Mike Scott, was first released in 1996 and became one of the most widely used corpus analysis suites in academic linguistics research. Other tools from this era include TACT (University of Toronto, 1989), a suite of MS-DOS freeware programs for literary text analysis, and MonoConc, a Windows concordancer created by Michael Barlow. === Web-based concordancers === From the late 1990s onwards, web-based concordancers hosted on remote servers gave researchers browser access to large preloaded corpora without requiring local storage or processing. The Sketch Engine, developed by Adam Kilgarriff and Pavel Rychlý (Masaryk University), was launched commercially in July 2003 by Lexical Computing Limited and introduced word sketches—automatically generated one-page profiles of a word's typical grammatical relations and collocations. AntConc, created by Laurence Anthony at Waseda University, Tokyo, was first released in 2002 as freeware for Windows, macOS, and Linux. == Features == Modern concordancers typically offer a range of analytical functions beyond basic KWIC display. These commonly include: KWIC display with the node word centred and context words in aligned columns, sortable by the word one, two, or three positions to the left or right of the node (L1–L3 and R1–R3) Concordance plots, visualising the distribution of hits as marks along a scaled bar representing each text in the corpus Frequency and word lists, both alphabetical and ranked by frequency Collocation statistics, identifying words that co-occur with the search term more often than chance, quantified by measures such as mutual information, the t-score, or log-likelihood Keyword analysis, comparing word frequencies between a study corpus and a reference corpus to identify statistically distinctive items N-gram analysis, finding frequently recurring word sequences of a specified length Part-of-speech tagging integration, allowing searches filtered to particular grammatical categories Unicode support for multilingual text Bilingual and parallel concordancers additionally display aligned text in two or more languages side by side, enabling comparison of translation equivalents across language pairs. == Notable concordancers == === WordSmith Tools === Created by Mike Scott and first released in 1996, WordSmith Tools is a Windows corpus analysis suite that evolved from MicroConcord. Its three core modules are Concord (KWIC concordances), WordList (frequency and alphabetical word lists), and Keywords (statistical keyword identification relative to a reference corpus). Oxford University Press used WordSmith Tools for dictionary preparation work. Version 4.0 is freely available; later versions are sold by Lexical Analysis Software Limited. === AntConc === AntConc is a freeware, multiplatform concordancing toolkit created by Laurence Anthony, Professor of Applied Linguistics at Waseda University, Tokyo. First released in 2002 and formally described in a 2005 academic paper, it runs on Windows, macOS, and Linux. Its tools include a KWIC concordancer, a concordance plot for visualising distribution across texts, a collocates tool, a keyword list, and an n-gram analysis module. Because it is free and requires only plain text files, AntConc is widely used in linguistics courses and independent research worldwide. === Sketch Engine === The Sketch Engine is a corpus management and query system co-created by Adam Kilgarriff and Pavel Rychlý and launched in 2003 by Lexical Computing Limited. It provides browser-based access to over 800 corpora in more than 100 languages. Beyond concordance searching, it offers word sketches, collocation analysis, distributional thesaurus construction, keyword and terminology extraction, and diachronic analysis. It is used by major publishers including Macmillan and Oxford University Press for lexicographic research. A subset tool, SKELL (Sketch Engine for Language Learning), is freely accessible to individual learners. === Wmatrix === Wmatrix is a web-based corpus processing environment developed by Paul Rayson at the University Centre for Computer Corpus Research on Language (UCREL), Lancaster University. Alongside concordances and frequency lists, Wmatrix integrates CLAWS part-of-speech tagging and the USAS semantic tagger, enabling keyword analysis simultane

VLLM

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab, the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs. According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory. == History == vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley. Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention, which presented the system as a high-throughput and memory-efficient serving engine for large language models. In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024. In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding. == Architecture == According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference. The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency. The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving. PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.