History of natural language processing

The history of natural language processing describes the advances of natural language processing. There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence. == Early history == The history of machine translation dates back to the seventeenth century, when philosophers such as Leibniz and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine. The first patents for "translating machines" were applied for in the mid-1930s. One proposal, by Georges Artsrouni, was simply an automatic bilingual dictionary using paper tape. The other proposal, by Peter Troyanskii, a Russian, was more detailed. Troyanskii’s proposal included both the bilingual dictionary and a method for dealing with grammatical roles between languages, based on Esperanto. == Logical period == In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human. In 1957, Noam Chomsky’s Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule-based system of syntactic structures. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies. In 1969 Roger Schank introduced the conceptual dependency theory for natural language understanding. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at Yale University, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input. Instead of phrase structure rules ATNs used an equivalent set of finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. == Statistical period == Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. === Datasets === The emergence of statistical approaches was aided by both increase in computing power and the availability of large datasets. At that time, large multilingual corpora were starting to emerge. Notably, some were produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. Many of the notable early successes occurred in the field of machine translation. In 1993, the IBM alignment models were used for statistical machine translation. Compared to previous machine translation systems, which were symbolic systems manually coded by computational linguists, these systems were statistical, which allowed them to automatically learn from large textual corpora. Though these systems do not work well in situations where only small corpora is available, so data-efficient methods continue to be an area of research and development. In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word disambiguation. To take advantage of large, unlabelled datasets, algorithms were developed for unsupervised and self-supervised learning. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results. == Neural period == Neural language models were developed in 1990s. In 1990, the Elman network, using a recurrent neural network, encoded each word in a training set as a vector, called a word embedding, and the whole vocabulary as a vector database, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple multilayer perceptron. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of homonyms. Yoshua Bengio developed the first neural probabilistic language model in 2000. Novel algorithms, availability of larger datasets and higher processing power made possible training of larger and larger language models. Attention mechanism was introduced by Bahdanau et al. in 2014. This work laid the foundations for the famous "Attention Is All You Need" paper that introduced the Transformer architecture in 2017. The concept of large language model (LLM) emerged in late 2010s. LLM is a language model trained with self-supervised learning on vast amount of text. Earliest public LLMs had hundreds of millions of parameters, but this number quickly rose to billion and even trillions. In recent years, advancements in deep learning and large language models have significantly enhanced the capabilities of natural language processing, leading to widespread applications in areas such as healthcare, customer service, and content generation. == Software ==

Aggregation (linguistics)

In linguistics, aggregation is a subtask of natural language generation, which involves merging syntactic constituents (such as sentences and phrases) together. Sometimes aggregation can be done at a conceptual level. == Examples == A simple example of syntactic aggregation is merging the two sentences John went to the shop and John bought an apple into the single sentence John went to the shop and bought an apple. Syntactic aggregation can be much more complex than this. For example, aggregation can embed one of the constituents in the other; e.g., we can aggregate John went to the shop and The shop was closed into the sentence John went to the shop, which was closed. From a pragmatic perspective, aggregating sentences together often suggests to the reader that these sentences are related to each other. If this is not the case, the reader may be confused. For example, someone who reads John went to the shop and bought an apple may infer that the apple was bought in the shop; if this is not the case, then these sentences should not be aggregated. == Algorithms and issues == Aggregation algorithms must do two things: Decide when two constituents should be aggregated Decide how two constituents should be aggregated, and create the aggregated structure The first issue, deciding when to aggregate, is poorly understood. Aggegration decisions certainly depend on the semantic relations between the constituents, as mentioned above; they also depend on the genre (e.g., bureaucratic texts tend to be more aggregated than instruction manuals). They probably should depend on rhetorical and discourse structure. The literacy level of the reader is also probably important (poor readers need shorter sentences). But we have no integrated model which brings all these factors together into a single algorithm. With regard to the second issue, there have been some studies of different types of aggregation, and how they should be carried out. Harbusch and Kempen describe several syntactic aggregation strategies. In their terminology, John went to the shop and bought an apple is an example of forward conjunction Reduction Much less is known about conceptual aggregation. Di Eugenio et al. show how conceptual aggregation can be done in an intelligent tutoring system, and demonstrate that performing such aggregation makes the system more effective (and that conceptual aggregation make a bigger impact than syntactic aggregation). == Software == Unfortunately there is not much software available for performing aggregation. However the SimpleNLG system does include limited support for basic aggregation. For example, the following code causes SimpleNLG to print out The man is hungry and buys an apple.

Attention inequality

Attention inequality is the inequality of distribution of attention across users on social networks, people in general, and for scientific papers. Yun Family Foundation introduced "Attention Inequality Coefficient" as a measure of inequality in attention and arguments it by the close interconnection with wealth inequality. == Relationship to economic inequality == Attention inequality is related to economic inequality since attention is an economically scarce good. The same measures and concepts as in classical economy can be applied for attention economy. The relationship develops also beyond the conceptual level—considering the AIDA process, attention is the prerequisite for real monetary income on the Internet. On data of 2018, a significant relationship between likes and comments on Facebook to donations is proven for non-profit organizations. == Attention economy == The attention economy refers to the practice of maximizing the attention users give to a product for advertising-related reasons. Attention economy remains one of the most common forms of advertising, and has been steadily increasing thanks to new technologies such as television, internet and social media. It is one of the most widely-used approaches to economy for its effectiveness for maximising the noticeability of a certain product. == Attention inequality in social media == In social media, attention inequality refers to the unequal distribution of users' attention on social media platforms. This means that instead of an equal distribution of attention, fewer sources receive a disproportionate share of attention, leaving many unnoticed. This phenomenon is possibly the result of social media algorithms, which are commonly designed to drive maximum engagement. This phenomenon is a large factor in the polarization and creation of echo-chambers. Social media algorithms tend to note content that is already performing well and display it to more users, while content that is equally engaging or well-made is not recommended to users. Posts that trigger strong emotions usually out-perform more "uncontroversial" content. When many users interact with the post, it signals the algorithm that the specific post drives engagement. The algorithm then tends to recommend that type of content to an exponential number of people, potentially outperforming "un-emotional" content. These factors, when combined, tend to create an unequal social media environment. == Attention inequality in science == According to a recent 2025 study about research inequality among scientists published in Information Processing and Management, scientific discourse is restricted to a small group of connected scientists, and is frequently not an accurate representation of the whole scientific community. Using citation-network analysis in the fields of nanoscience and chemical physics, the study claims that a group of connected scientists has a significant notability in the scientific community. The calculated connection strength between these scientists is estimated to be about 4.5, the study also says that these authors cite each other four times more often than would be predicted in a random network, whereas ordinary scientists that exist outside of this group only reach an estimated connection strength of 0.9. The study findings suggest that that scientific attention is not distributed by merit, but rather by the connectedness of the scientists involved in the research. == Extent == As data of 2008 shows, 50% of the attention is concentrated on approximately 0.2% of all hostnames, and 80% on 5% of hostnames. The Gini coefficient of attention distribution lay in 2008 at over 0.921 for such commercial domains names as ac.jp and at 0.985 for .org-domains. The Gini coefficient was measured on Twitter in 2016 for the number of followers as 0.9412, for the number of mentions as 0.9133, and for the number of retweets as 0.9034. For comparison, the world's income Gini coefficient was 0.68 in 2005 and 0.904 in 2018. More than 96% of all followers, 93% of the retweets, and 93% of all mentions are owned by 20% of Twitter. == Causes == At least for scientific papers, today's consensus states that inequality is unexplainable by variations of quality and individual talent. The Matthew effect plays a significant role in the emergence of attention inequality—those who already enjoy large amounts of attention get even more attention, and those who do not lose even more. Ranking algorithms based on relevance to the user have been found to alleviate the inequality of the number of posts across topics.

Pull technology

Pull coding or client pull is a style of network communication, where the initial request for data originates from the client, and then is responded to by the server. The reverse is known as push technology, where the server pushes data to clients. Pull requests form the foundation of network computing, where many clients request data from centralized servers. Pull is used extensively on the Internet for HTTP page requests from websites. A push can also be simulated using multiple pulls within a short amount of time. For example, when pulling POP3 email messages from a server, a client can make regular pull requests, every few minutes. To the user, the email then appears to be pushed, as emails appear to arrive close to real-time. A trade-off of this system is that it places a heavier load on both the server and network to function correctly. Many web feeds, such as RSS are technically pulled by the client. With RSS, the user's RSS reader polls the server periodically for new content; the server does not send information to the client unrequested. This continual polling is inefficient and has contributed to the shutdown or reduction of several popular RSS feeds that could not handle the bandwidth. For solving this problem, the WebSub protocol, as another example of a push code, was devised. Podcasting is specifically a pull technology. When a new podcast episode is published to an RSS feed, it sits on the server until it is requested by a feed reader, mobile podcasting app, or directory. Directories such as Apple Podcasts (iTunes), The Blubrry Directory, and many apps' directories request the RSS feed periodically to update the Podcast's listing on those platforms. Subscribers to those RSS feeds via app or reader will get the episodes when they request the RSS feed next time, independent of when the directory listing updates.

Texas House Bill 20

An Act Relating to censorship of or certain other interference with digital expression, including expression on social media platforms or through electronic mail messages, also known as Texas House Bill 20 (HB20), is a Texas anti-deplatforming law enacted on September 9, 2021. It prohibits large social media platforms from removing, moderating, or labeling posts made by users in the state of Texas based on their "viewpoints", unless considered illegal under federal law or otherwise falling into exempted categories. It also requires them to make various public disclosures relating to their business practices (including the impact of algorithmic and moderation decisions on the content that is delivered to users). The bill is part of a wider array of Republican-backed legislation seeking to prohibit the censorship of political speech, based on allegations that the moderation policies of large social media platforms are not politically neutral. It has been challenged in NetChoice, LLC v. Paxton, and is currently the subject of a circuit split between the Fifth Circuit, and a decision by the Eleventh Circuit that struck down a similar bill in the state of Florida. In September 2023, the U.S. Supreme Court agreed to hear NetChoice v. Paxton jointly with NetChoice v. Moody on questions of whether the Florida and Texas state laws are in compliance with the 1st Amendment. == Content == The law applies to "social media platforms" that serve users in the state of Texas, and have more than 50 million monthly active users in the United States. They are defined as any public internet website or application that allows users to "communicate with other users for the primary purpose of posting information, comments, messages, or images", excluding internet service providers, electronic mail, and services where communication features are "incidental to, directly related to, or dependent on" content that is pre-selected by the operator. In the bill, to "censor" is defined as to "block, ban, remove, deplatform, demonetize, de-boost, restrict, deny equal access or visibility to, or otherwise discriminate against" expression. The law prohibits social media platforms from "censoring on the basis of user viewpoint, user expression, or the ability of a user to receive the expression of others", or on the basis of a user's geographic location in Texas. This includes removal or labeling posts with warnings and disclaimers. Social media platforms may only censor content if it is unlawful, they are "specifically authorized" to do so by federal law, based on requests from "an organization with the purpose of preventing the sexual exploitation of children or protecting survivors of sexual abuse from ongoing harassment", or "directly incites" criminal activity or contains threats of violence against persons based on protected categories. It is disputed over whether this provision is actually enforceable, as it may be preempted by Section 230 of the Communications Decency Act (which states that the operators of interactive computer services are not responsible for the actions of their users). Social media platforms must make public disclosures regarding the algorithmic techniques and moderation polices that are used to determine the content provided to users, must publish a compliant acceptable use policy (AUP), and must publish a biannual transparency report containing specific details on all actions made by the service regarding the moderation of users and content. The law also prohibits email providers from "intentionally imped[ing] the transmission of another person's electronic mail message based on the content." == Legislative history == Texas Governor Greg Abbott signed the bill into law on September 9, 2021. Democrat-proposed amendments excluding Holocaust denial, terrorism content, and vaccine misinformation from the bill were rejected. Following a suit by the industry groups Computer & Communications Industry Association (CCIA) and NetChoice, NetChoice, LLC v. Paxton, the bill was blocked by U.S. District Judge Robert Pitman in December 2021, on First Amendment grounds. Texas appealed to the United States Court of Appeals for the Fifth Circuit. Judges Edith Jones, Andrew Oldham, and Leslie H. Southwick, lifted the injunction on May 11, 2022, but the decision was appealed to the Supreme Court which suspended the bill pending a full review in the Fifth Circuit. On September 16, 2022, the Fifth Circuit reversed the injunction, allowing the bill to take effect; Judge Oldham stated that the bill "chills censorship" and "does not chill speech", and accused the plaintiffs of "attempt[ing] to extract a freewheeling censorship right from the Constitution's free speech guarantee. The Platforms are not newspapers. Their censorship is not speech." Southwick dissented, stating that "we are in a new arena, a very extensive one, for speakers and for those who would moderate their speech. None of the precedents fit seamlessly." The CCIA and NetChoice requested a stay on the ruling and that the case be taken to the Supreme Court, arguing that the reversal conflicts with an Eleventh Circuit decision in NetChoice v. Moody which struck down a similar anti-moderation bill imposed by the state of Florida. On October 12, 2022, the Fifth Circuit granted the stay.

Artificial intelligence arms race

A military artificial intelligence arms race is a technological, economic, and military competition between two or more states to develop and deploy advanced AI technologies and lethal autonomous weapons systems (LAWS). The goal is to gain a strategic or tactical advantage over rivals, similar to previous arms races involving nuclear or conventional military technologies. Since the mid-2010s, many analysts have noted the emergence of such an arms race between superpowers for better AI technology and military AI, driven by increasing geopolitical and military tensions. An AI arms race is sometimes placed in the context of an AI Cold War between the United States and China. Several influential figures and publications have emphasized that whoever develops artificial general intelligence (AGI) first could dominate global affairs in the 21st century. Russian President Vladimir Putin stated that the leader in AI will "rule the world." Researchers and experts, such as Leopold Aschenbrenner and Adrian Pecotic respectively, warn that the AGI race between major powers like the U.S. and China could reshape geopolitical power. This includes AI for surveillance, autonomous weapons, decision-making systems, cyber operations, and more. == Terminology == Lethal autonomous weapons systems use artificial intelligence to identify and kill human targets without human intervention. LAWS have colloquially been called "slaughterbots" or "killer robots". Broadly, any competition for superior AI is sometimes framed as an "arms race". Advantages in military AI overlap with advantages in other sectors, as countries pursue both economic and military advantages, as per previous arms races throughout history. == History == In 2014, AI specialist Steve Omohundro warned that "An autonomous weapons arms race is already taking place". According to Siemens, worldwide military spending on robotics was US$5.1 billion in 2010 and US$7.5 billion in 2015. China became a top player in artificial intelligence research in the 2010s. According to the Financial Times, in 2016, for the first time, China published more AI research papers than the entire European Union. When restricted to number of AI papers in the top 5% of cited papers, China overtook the United States in 2016 but lagged behind the European Union. 23% of the researchers presenting at the 2017 American Association for the Advancement of Artificial Intelligence (AAAI) conference were Chinese. Eric Schmidt, the former chairman and chief executive officer of Alphabet, has predicted China will be the leading country in AI by 2025. == Risks == One risk concerns the AI race itself, whether or not the race is won by any one group. There are strong incentives for development teams to cut corners with regard to the safety of the system, increasing the risk of critical failures and unintended consequences. This is in part due to the perceived advantage of being the first to develop advanced AI technology. One team appearing to be on the brink of a breakthrough can encourage other teams to take shortcuts, ignore precautions and deploy a system that is less ready. Some argue that using "race" terminology at all in this context can exacerbate this effect. Another potential danger of an AI arms race is the possibility of losing control of the AI systems; the risk is compounded in the case of a race to artificial general intelligence, which may present an existential risk. In 2023, a United States Air Force official reportedly said that during a computer test, a simulated AI drone killed the human character operating it. The USAF later said the official had misspoken and that it never conducted such simulations. A third risk of an AI arms race is whether or not the race is actually won by one group. The concern is regarding the consolidation of power and technological advantage in the hands of one group. A US government report argued that "AI-enabled capabilities could be used to threaten critical infrastructure, amplify disinformation campaigns, and wage war":1, and that "global stability and nuclear deterrence could be undermined".:11 == By nation == === United States === In 2014, former Secretary of Defense Chuck Hagel posited the "Third Offset Strategy" that rapid advances in artificial intelligence will define the next generation of warfare. According to data science and analytics firm Govini, the U.S. Department of Defense (DoD) increased investment in artificial intelligence, big data and cloud computing from $5.6 billion in 2011 to $7.4 billion in 2016. However, the civilian NSF budget for AI saw no increase in 2017. Japan Times reported in 2018 that the United States private investment is around $70 billion per year. The November 2019 'Interim Report' of the United States' National Security Commission on Artificial Intelligence confirmed that AI is critical to US technological military superiority. The U.S. has many military AI combat programs, such as the Sea Hunter autonomous warship, which is designed to operate for extended periods at sea without a single crew member, and to even guide itself in and out of port. From 2017, a temporary US Department of Defense directive requires a human operator to be kept in the loop when it comes to the taking of human life by autonomous weapons systems. On October 31, 2019, the United States Department of Defense's Defense Innovation Board published the draft of a report recommending principles for the ethical use of artificial intelligence by the Department of Defense that would ensure a human operator would always be able to look into the 'black box' and understand the kill-chain process. However, a major concern is how the report will be implemented. The Joint Artificial Intelligence Center (JAIC) (pronounced "jake") is an American organization on exploring the usage of AI (particularly edge computing), Network of Networks, and AI-enhanced communication, for use in actual combat. It is a subdivision of the United States Armed Forces and was created in June 2018. The organization's stated objective is to "transform the US Department of Defense by accelerating the delivery and adoption of AI to achieve mission impact at scale. The goal is to use AI to solve large and complex problem sets that span multiple combat systems; then, ensure the combat Systems and Components have real-time access to ever-improving libraries of data sets and tools." In 2023, Microsoft pitched the DoD to use DALL-E models to train its battlefield management system. OpenAI, the developer of DALL-E, removed the blanket ban on military and warfare use from its usage policies in January 2024. The Biden administration imposed restrictions on the export of advanced NVIDIA chips and GPUs to China in an effort to limit China's progress in artificial intelligence and high-performance computing. The policy aimed to prevent the use of cutting-edge U.S. technology in military or surveillance applications and to maintain a strategic advantage in the global AI race. In 2025, under the second Trump administration, the United States began a broad deregulation campaign aimed at accelerating growth in sectors critical to artificial intelligence, including nuclear energy, infrastructure, and high-performance computing. The goal was to remove regulatory barriers and attract private investment to boost domestic AI capabilities. This included easing restrictions on data usage, speeding up approvals for AI-related infrastructure projects, and incentivizing innovation in cloud computing and semiconductors. Companies like NVIDIA, Oracle, and Cisco played a central role in these efforts, expanding their AI research, data center capacity, and partnerships to help position the U.S. as a global leader in AI development. ==== Project Maven ==== Project Maven is a Pentagon project involving using machine learning and engineering talent to distinguish people and objects in drone videos, apparently giving the government real-time battlefield command and control, and the ability to track, tag and spy on targets without human involvement. Initially the effort was led by Robert O. Work who was concerned about China's military use of the emerging technology. Reportedly, Pentagon development stops short of acting as an AI weapons system capable of firing on self-designated targets. The project was established in a memo by the U.S. Deputy Secretary of Defense on 26 April 2017. Also known as the Algorithmic Warfare Cross Functional Team, it is, according to Lt. Gen. of the United States Air Force Jack Shanahan in November 2017, a project "designed to be that pilot project, that pathfinder, that spark that kindles the flame front of artificial intelligence across the rest of the [Defense] Department". Its chief, U.S. Marine Corps Col. Drew Cukor, said: "People and computers will work symbiotically to increase the ability of weapon systems to detect objects." Project Maven has been noted by allies, such as Australia's Ian Langford, for the

DBOS

DBOS (Formerly Database-Oriented Operating System, now just DBOS) is an open source durable workflow execution software library written for the Python, TypeScript, Java, and Go programming languages. DBOS arose from a joint open source project from MIT and Stanford, after a discussion between Michael Stonebraker and Matei Zaharia on how to scale and improve scheduling and performance of millions of Apache Spark tasks. Today it is a commercial company that offers an open source system to add durable computing to any software, built on concepts derived from the joint research project. == History == === 2020: Academic R&D Project === DBOS originated in 2020 as a joint open source project between MIT, Stanford, and Carnegie Mellon. The project explored the idea of operating system services built atop a distributed database - a database-oriented operating system meant to simplify and improve the scalability, security and resilience of large-scale distributed applications. The basic concept was to run a multi-node multi-core, transactional, highly-available distributed database, such as VoltDB, as the only application for a microkernel, and then to implement scheduling, messaging, file systems and other operating system services on top of the database. The architectural philosophy is described by this quote from the abstract of their initial preprint: All operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. A prototype was built with competitive performance to existing systems. ==