Ancient text corpora

Ancient text corpora

Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage. Chinese, Latin, and Greek are examples of ancient languages with significant text corpora, although much of these corpora are known to us via transmission (frequently via medieval manuscript copies) rather than in their original form. These texts – both transmitted and original – provide valuable insights into the history and culture of different regions of the world, and have been studied for centuries by scholars and researchers. Other ancient texts – particularly stone inscriptions and papyrus scrolls – have been published following archaeological research, notably the cuneiform corpus of c.10 million words and the c.5 million words in ancient Egyptian. Through advances in technology and digitization, ancient text corpora are more accessible than ever before. Tools such as the Perseus Digital Library and the Digital Corpus of Sanskrit have made it easier for researchers to access and analyze these texts. == Quantifying the corpora == Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese, Brahmi, Greek, Latin, Hebrew and Avestan tradition), and those known from original inscriptions, papyri and other manuscripts. Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the Book of the Dead and the coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers. === Languages with known size estimates === === South Asian === Sanskrit (Vedic Sanskrit and Classical Sanskrit) Indus script (3,800 items, c.20,000 characters) Brahmi script Old Tamil Early Indian epigraphy and Indian epic poetry Kharosthi Pali literature List of historic Indian texts === Mesoamerican === Olmec hieroglyphs Maya script === East Asian === Old Chinese Chinese classics The pre-Qin corpus: a collection of ancient Chinese texts written before the Qin dynasty (221 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought. The pre-Han corpus: a collection of ancient Chinese texts written before the Han dynasty (202 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought. See the Chinese Text Project Chinese bronze inscriptions, Oracle bone script, Seal script, Clerical script === Central Iranian languages === Prior to 300 AD, the Central Iranian languages are mainly in the form of Sassanid stone inscriptions in the two closely related idioms Middle Persian (Pahlavi scripts and Inscriptional Parthian), there are 5000 for the corpus of Middle Persian (mostly 3rd, but also 4th/5th centuries) and for the corpus of Parthian (3rd century) 3000 words. To what extent some of the Manichaean Middle Persian literary texts may date back to the 3rd century is difficult to estimate; Mani is said to have personally written the Shabuhragan totaling about 5000 words. In any case, if we combine Middle Persian and Parthian, we come to over 10,000 words. === Proto-Sinaitic === Proto-Sinaitic script has no more than about 400 letters (number of words is unknown since the script has not been fully interpreted). To a similar extent, there are probably approximately contemporaneous Proto-Canaanite inscriptions (ibid.). === Anatolian === Luwian cuneiform, approx. 3000 words the Palaic language few hundred words. Hieroglyphic Luwian the Lycian alphabet (the best attested Anatolian successor language written in alphabetic script) with about 5000 words The Lydian alphabet 109 inscriptions comprising about 1500 words The Phrygian alphabet the in-tomb inscriptions from the 2nd and 3rd centuries AD (approx. 1000 words) and in the so-called "old Phrygian" inscriptions less than 300 words The Carian alphabets whose texts, mainly from Egypt, contain around 600 words. === Old Italic === the Umbrian language attested essentially by the sacrificial instructions of the Iguvinian Tables with 5000 words the Oscan language (ibid.) with 2000 words the Messapic language with probably a good 1000 words (the estimate is difficult because most texts in this hardly understandable language do not use word separators) the Venetic language a few hundred words the Faliscan language a few hundred words Cisalpine Celtic inscriptions amount to approximately 2000 words, to which are added a number of glosses by classical authors === Iberia === Iberian scripts, more rarely written in Greek or Latin script, approx. 2500 words Celtiberian script, which refers to Celtic language testimonies in Iberian, but also in Latin script from Spain (approx. 1000 words) Southwest Paleohispanic script, 78 inscriptions, a few hundred words Lusitanian language, three monuments in Latin script, approx. 60 words === Germanic Northern Europe === Runic inscriptions dated before the 4th century amount to about 30 pieces, which contain no more than 50 words in total === Africa === Geʽez script: comparatively few inscriptions with a total of around 1,000 words before 300 AD. Following Christianization in the 4th century, more extensive texts are known. Libyco-Berber alphabet: over 1,000 inscriptions from the Maghreb, which are dated to Roman times. Most texts do not use a word separator; Peust estimates that the total number of words could be around 5,000 Meroitic script (Ancient Nubian): about 900 texts are known, which Peust estimates may contain approximately 10,000 words, albeit with uncertainty from the fact that the word separator is not used consistently in the Meroitic script. === Aegean === The Cretan Linear A inscriptions that have not yet been deciphered are available in about 2500 texts, which contain a total of around 20,000 characters. The total number of words can hardly be determined; Peust tentatively put it in the same order of magnitude as in Meroitic. In addition to the Linear A texts, there are also inscriptions Cretan hieroglyphs of a few hundred characters and texts written in the Greek alphabet, but not in Greek, with a few dozen words Cypriot syllabary in the first millennium BC, in which mostly Greek texts were recorded. The relevant texts comprise around 100 to 200 words. === Micro corpora === There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain. == Preservation and curation == Historic preservation and maintaining ancient text corpora presents several challenges, including issues with preservation, translation, and digitization. Many ancient texts have been lost over time, and those that survive may be damaged or fragmented. Translating ancient languages and scripts requires specialized expertise, and digitizing texts can be time-consuming and resource-intensive. == Corpus linguistics == The field of corpus linguistics studies language as expressed in text corpora. This includes the analysis of word frequency, collocations, grammar, and semantics. Ancient text corpora provide a valuable resource for corpus linguistics research, enabling scholars to explore the evolution of language and culture over time.

Artificial general intelligence

Artificial general intelligence (AGI) is a hypothetical type of artificial intelligence that matches or surpasses human capabilities across virtually all cognitive tasks. Beyond AGI, artificial superintelligence (ASI) would outperform the best human abilities across every domain by a wide margin. Unlike artificial narrow intelligence (ANI), whose competence is confined to well‑defined tasks, an AGI system can generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming. Creating AGI is a stated goal of technology companies such as OpenAI, Google, xAI, and Meta. A 2020 survey identified 72 active AGI research and development projects across 37 countries. AGI is a common topic in science fiction and futures studies. Contention exists over whether AGI represents an existential risk. Some AI experts and industry figures have stated that mitigating the risk of human extinction posed by AGI should be a global priority. Others find the development of AGI to be in too remote a stage to present such a risk. == Terminology == AGI is also known as strong AI, full AI, human-level AI, human-level intelligent AI, or general intelligent action. The term "artificial general intelligence" was used in 1997 by Mark Gubrud in a discussion of the implications of fully automated military production and operations. A mathematical formalism of AGI named AIXI was proposed in 2000 by Marcus Hutter, who defines intelligence as "an agent’s ability to achieve goals or succeed in a wide range of environments". This type of AGI has also been called "universal artificial intelligence". The term AGI was re-introduced and popularized by Shane Legg and Ben Goertzel around 2002. Some academic sources reserve the term "strong AI" for computer programs that will experience sentience or consciousness. In contrast, weak AI (or narrow AI) can solve a specific problem but lacks general cognitive abilities. Some academic sources use "weak AI" to refer more broadly to any programs that neither experience consciousness nor have a mind in the same sense as humans. Related concepts include artificial superintelligence and transformative AI. An artificial superintelligence (ASI) is a hypothetical type of AGI that is much more generally intelligent than humans, while the notion of transformative AI relates to AI having a large impact on society, for example, similar to the agricultural or industrial revolution. A framework for classifying AGI was proposed in 2023 by Google DeepMind researchers. They define five performance levels of AGI: emerging, competent, expert, virtuoso, and superhuman. For example, a competent AGI is defined as an AI that outperforms 50% of skilled adults in a wide range of non-physical tasks, and a superhuman AGI (i.e., an artificial superintelligence) is similarly defined but with a threshold of 100%. They consider large language models like ChatGPT or LLaMA 2 to be instances of emerging AGI (comparable to unskilled humans). Regarding the autonomy of AGI and associated risks, they define five levels: tool (fully in human control), consultant, collaborator, expert, and agent (fully autonomous). == Characteristics == There is no single agreed-upon definition of intelligence as applied to computers. Computer scientist John McCarthy wrote in 2007: "We cannot yet characterize in general what kinds of computational procedures we want to call intelligent." === Intelligence traits === Researchers generally hold that a system is required to do all of the following to be regarded as an AGI: reason, use strategy, solve puzzles, and make judgments under uncertainty, represent knowledge, including common sense knowledge, plan, learn, communicate in natural language, if necessary, integrate these skills in completion of any given goal. Many interdisciplinary approaches (e.g. cognitive science, computational intelligence, and decision making) consider additional traits such as imagination (the ability to form novel mental images and concepts) and autonomy. Computer-based systems exhibiting these capabilities are now widespread, with modern large language models demonstrating computational creativity, automated reasoning, and decision support simultaneously across domains. === Physical traits === Other capabilities are considered desirable in intelligent systems, as they may affect intelligence or aid in its expression. These include: the ability to sense (e.g. see, hear, etc.), and the ability to act (e.g. move and manipulate objects, change location to explore, etc.) This includes the ability to detect and respond to hazard. === Tests for human-level AGI === Several tests meant to confirm human-level AGI have been considered. ==== Turing test ==== The Turing test was proposed by Alan Turing in his 1950 paper "Computing Machinery and Intelligence". This test involves a human judge engaging in natural language conversations with both a human and a machine designed to generate human-like responses. The machine passes the test if it can convince the judge that it is human a significant fraction of the time. Turing proposed this as a practical measure of machine intelligence, focusing on the ability to produce human-like responses rather than on the internal workings of the machine. The idea of the test is that the machine has to try and pretend to be a man, by answering questions put to it, and it will only pass if the pretence is reasonably convincing. A considerable portion of a jury, who should not be experts about machines, must be taken in by the pretence. In 2014, a chatbot named Eugene Goostman, designed to imitate a 13-year-old Ukrainian boy, reportedly passed a Turing Test event by convincing 33% of judges that it was human. However, this claim was met with significant skepticism from the AI research community, who questioned the test's implementation and its relevance to AGI. A 2025 pre‑registered, three‑party Turing‑test study by Cameron R. Jones and Benjamin K. Bergen showed that GPT-4.5 was judged to be the human in 73% of five‑minute text conversations—surpassing the 67% humanness rate of real confederates and meeting the researchers' criterion for having passed the test. ==== Ikea test ==== The "Ikea test", also known as the Flat Pack Furniture Test, involves an AI controlling a robot which attempts to assemble an Ikea flat-pack furniture product after having been shown the parts and instructions. As early as 2013, MIT's IkeaBot demonstrated fully autonomous multi-robot assembly of an IKEA Lack table in ten minutes, with no human intervention and no pre-programmed assembly instructions. The robots inferred the assembly sequence from the geometry of the parts alone. ==== Coffee test ==== Steve Wozniak proposed a test where a machine is required to enter an average American home and figure out how to make coffee. It must find the coffee machine, find the coffee, add water, find a mug, and brew the coffee by pushing the proper buttons. This test has been substantially approached across multiple systems. In January 2024, Figure AI's Figure 01 humanoid learned to operate a Keurig coffee machine autonomously after watching video demonstrations, using end-to-end neural networks to translate visual input into motor actions. In 2025, researchers at the University of Edinburgh published the ELLMER framework in Nature Machine Intelligence, demonstrating a robotic arm that interprets verbal instructions, analyses its surroundings, and autonomously makes coffee in dynamic kitchen environments — adapting to unforeseen obstacles in real time rather than following pre-programmed sequences. ==== Suleyman's test ==== Mustafa Suleyman's test proposes giving an AI model US$100,000 and asking it to obtain US$1 million. ==== Use of video-games ==== Adams, et al. propose that the ability to learn and succeed in a wide range of video games can be used to test AI intelligence. This range would include games unknown to the AGI developers before the test is administered. === AI-complete problems === A problem is informally called "AI-complete" or "AI-hard" if it is believed that AGI would be needed to solve it, because the solution is beyond the capabilities of a purpose-specific algorithm. == History == === Classical AI === Modern AI research began in the mid-1950s. The first generation of AI researchers were convinced that artificial general intelligence was possible and that it would exist in just a few decades. AI pioneer Herbert A. Simon wrote in 1965: "machines will be capable, within twenty years, of doing any work a man can do". Their predictions were the inspiration for Stanley Kubrick and Arthur C. Clarke's fictional character HAL 9000, who embodied what AI researchers believed they could create by the year 2001. AI pioneer Marvin Minsky was a consultant on the project of making HAL 9000 as realistic as possible according to the consensus predictions of the time. He said in 1967, "Within a generation... the problem of

Fansly

Fansly is a subscription-based social media platform that allows content creators to monetize exclusive content, including photos, videos, live streams, and direct messages. Operated by Select Media LLC, the platform is headquartered in Baltimore, Maryland. While the platform hosts a variety of content genres, it is primarily known for adult content and is frequently compared to OnlyFans. == History == Fansly was launched in 2020 by Micheal Etelis under Select Media LLC, which was incorporated in February 2020. The platform also operates through CY Media LTD, registered in Kamares, Cyprus, established in May 2021. The company has remained privately held with no disclosed external funding rounds or official valuation, operating as a bootstrapped entity. Based on Fansly's social media presence, which was created in November 2020, the platform did not begin gaining traction until early 2021 when creators started to become concerned about potential content policy changes at OnlyFans. In August 2021, OnlyFans announced it would ban sexually explicit content effective October 2021, citing pressure from banks involved in its payment processing. Although OnlyFans reversed the decision six days later, the announcement triggered a massive influx of users to Fansly; the platform received nearly 4,000 new creator applications in a single hour, causing its servers to crash from the surge in traffic. By August 21, 2021, Fansly had reached 2.1 million users. == Features and business model == Fansly operates as a B2C marketplace, taking a 20% commission on all transactions conducted on the platform, with creators retaining the remaining 80%. This commission rate is the same as that charged by its main competitor, OnlyFans. A distinguishing feature of Fansly is its tiered subscription model, which allows creators to set multiple subscription levels at different price points, each offering different perks such as exclusive content, chat access, or custom requests. By contrast, OnlyFans historically relied on a single-tier subscription model. Revenue streams on the platform include recurring subscriptions, one-time pay-per-view content purchases, tips, paid messaging, and live-streaming fees. The platform also features an algorithmic "For You" feed that helps users discover new creators, addressing a limitation of competitors that lack internal content promotion mechanisms. Additional features include content watermarking, geolocation blocking to control where content is visible, two-factor authentication, community polls, 24-hour stories, and social media integration with platforms such as Twitter and Twitch. Payouts are processed within one to two business days and support multiple methods, including bank transfers, Skrill, Paxum, and cryptocurrency. In December 2025, Fansly expanded its live-streaming capabilities, introducing ticketed access, private list gating, configurable chat permissions, stream goals, and interactive device integration. == Controversies == === OnlyFans anti-competitive allegations === In August 2022, a series of lawsuits were filed in the United States alleging that OnlyFans had bribed employees of Meta Platforms to place Instagram accounts of creators who also sold content on competitor platforms, including Fansly, onto a terrorist blacklist. The lawsuits alleged that adult performers had traffic driven away from their Instagram accounts after being falsely tagged as terror-related. OnlyFans denied awareness of such activity. The plaintiffs withdrew the bribery claim in July 2023, and the case was dismissed in August 2023. === Privacy class action === In June 2025, Select Media LLC (operating as Fansly) was the subject of a digital privacy class action lawsuit filed in Massachusetts District Court. The lawsuit alleged that the platform secretly collected and shared users' sensitive viewing data with Google and other third parties without consent. The case was brought on behalf of an estimated class of over 10,000 users across multiple states.

False answer supervision

False answer supervision (FAS) refers to VoIP fraud where the billed duration for the caller is more than the duration of the actual connection duration. The FAS is usually performed by VoIP wholesalers in their softswitches for randomly selected calls. Adding a small amount of extra billed seconds for many calls results in significant revenue for the VoIP wholesaler. == Implementation of FAS == The FAS fraud can be implemented in a softswitch in many different ways. These include: False billing of party A without calling a party B. Usually a fake ringback tone, loopback audio or voicemail message is played Start of billing before actual answer of party B Extra billing after disconnection of party B == Detection of FAS == The FAS can be detected and blocked in a softswitch. Common methods are: Manual verification of call detail records: listening to voice recordings Identification of FAS types and using algorithms to automatically detect the FAS RTP audio signal processing: detection of voice RTP audio signal processing: detection of silence RTP audio signal processing: detection of ringback tone

Creator economy

The creator economy, also known as influencer economy, is a platform-driven economy in which creators produce content, products, or services and distribute them directly to their audience through social media platforms and emerging technologies. This economic model is based on the ability of creators to build and maintain communities of users, monetizing their creative activity through multiple channels including advertising, sponsorships, product sales, crowdfunding, and subscription-based services. Creators include various professional categories such as social media influencers, YouTubers, bloggers, artists, online educators, podcasters, and independent professionals, who use platforms as infrastructure to reach their audience without necessarily relying on traditional intermediaries in the cultural and media industry. According to Goldman Sachs Research, the ongoing growth of the creator economy will likely benefit companies that possess a combination of factors, including a large global user base, access to substantial capital, robust AI-powered recommendation engines, versatile monetization tools, comprehensive data analytics, and integrated e-commerce options. Examples of creator economy software platforms include YouTube, TikTok, Instagram, Facebook, Twitch, Spotify, Substack, OnlyFans and Patreon. == History == The term "creator" was coined by YouTube in 2011 to be used instead of "YouTube star", an expression that at the time could only apply to famous individuals on the platform. The term has since become omnipresent and is used to describe anyone creating any form of online content. A number of platforms such as TikTok, Snapchat, YouTube, and Facebook have set up funds with which to pay creators. == Criticism == The large majority of content creators derive no monetary gain for their creations, with most of the benefits accruing to the platforms who can make significant revenues from their uploads. As few as 0.1% of creators are able to earn a living through their channels.

Expectation propagation

Expectation propagation (EP) is a technique in Bayesian machine learning. EP finds approximations to a probability distribution. It uses an iterative approach that uses the factorization structure of the target distribution. It differs from other Bayesian approximation approaches such as variational Bayesian methods. More specifically, suppose we wish to approximate an intractable probability distribution p ( x ) {\displaystyle p(\mathbf {x} )} with a tractable distribution q ( x ) {\displaystyle q(\mathbf {x} )} . Expectation propagation achieves this approximation by minimizing the Kullback–Leibler divergence K L ( p | | q ) {\displaystyle \mathrm {KL} (p||q)} . Variational Bayesian methods minimize K L ( q | | p ) {\displaystyle \mathrm {KL} (q||p)} instead. If q ( x ) {\displaystyle q(\mathbf {x} )} is a Gaussian N ( x | μ , Σ ) {\displaystyle {\mathcal {N}}(\mathbf {x} |\mu ,\Sigma )} , then K L ( p | | q ) {\displaystyle \mathrm {KL} (p||q)} is minimized with μ {\displaystyle \mu } and Σ {\displaystyle \Sigma } being equal to the mean of p ( x ) {\displaystyle p(\mathbf {x} )} and the covariance of p ( x ) {\displaystyle p(\mathbf {x} )} , respectively; this is called moment matching. == Applications == Expectation propagation via moment matching plays a vital role in approximation for indicator functions that appear when deriving the message passing equations for TrueSkill.

Web testing

Web testing is software testing that focuses on web applications. Complete testing of a web-based system before going live can help address issues before the system is revealed to the public. Issues may include the security of the web application, the basic functionality of the site, its accessibility to disabled and fully able users, its ability to adapt to the multitude of desktops, devices, and operating systems, as well as readiness for expected traffic and number of users and the ability to survive a massive spike in user traffic, both of which are related to load testing. == Web application performance tool == A web application performance tool (WAPT) is used to test web applications and web related interfaces. These tools are used for performance, load and stress testing of web applications, web sites, web API, web servers and other web interfaces. WAPT tends to simulate virtual users which will repeat either recorded URLs or specified URL and allows the users to specify number of times or iterations that the virtual users will have to repeat the recorded URLs. By doing so, the tool is useful to check for bottleneck and performance leakage in the website or web application being tested. A WAPT faces various challenges during testing and should be able to conduct tests for: Browser compatibility Operating System compatibility Windows application compatibility where required WAPT allows a user to specify how virtual users are involved in the testing environment.ie either increasing users or constant users or periodic users load. Increasing user load, step by step is called RAMP where virtual users are increased from 0 to hundreds. Constant user load maintains specified user load at all time. Periodic user load tends to increase and decrease the user load from time to time. == Web security testing == Web security testing tells us whether Web-based applications requirements are met when they are subjected to malicious input data. There is a web application security testing plug-in collection for Fire Fox == Web API testing == An application programming interface API exposes services to other software components, which can query the API. The API implementation is in charge of computing the service and returning the result to the component that send the query. A part of web testing focuses on testing these web API implementations. GraphQL is a specific query and API language. It is the focus of tailored testing techniques. Search-based test generation yields good results to generate test cases for GraphQL APIs.