Cross-language information retrieval

Cross-language information retrieval

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources: Dictionary-based CLIR techniques Parallel corpora based CLIR techniques Comparable corpora based CLIR techniques Machine translator based CLIR techniques CLIR systems have improved so much that the most accurate multi-lingual and cross-lingual adhoc information retrieval systems today are nearly as effective as monolingual systems. Other related information access tasks, such as media monitoring, information filtering and routing, sentiment analysis, and information extraction require more sophisticated models and typically more processing and analysis of the information items of interest. Much of that processing needs to be aware of the specifics of the target languages it is deployed in. Mostly, the various mechanisms of variation in human language pose coverage challenges for information retrieval systems: texts in a collection may treat a topic of interest but use terms or expressions which do not match the expression of information need given by the user. This can be true even in a mono-lingual case, but this is especially true in cross-lingual information retrieval, where users may know the target language only to some extent. The benefits of CLIR technology for users with poor to moderate competence in the target language has been found to be greater than for those who are fluent. Specific technologies in place for CLIR services include morphological analysis to handle inflection, decompounding or compound splitting to handle compound terms, and translations mechanisms to translate a query from one language to another. The first workshop on CLIR was held in Zürich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of the Cross Language Evaluation Forum (CLEF). Researchers also convene at the annual Text Retrieval Conference (TREC) to discuss their findings regarding different systems and methods of information retrieval, and the conference has served as a point of reference for the CLIR subfield. Early CLIR experiments were conducted at TREC-6, held at the National Institute of Standards and Technology (NIST) on November 19–21, 1997. Google Search had a cross-language search feature that was removed in 2013.

Symbolic artificial intelligence

In artificial intelligence, symbolic artificial intelligence (also known as classical artificial intelligence or logic-based artificial intelligence) is the term for the collection of all methods in artificial intelligence research that are based on high-level symbolic (human-readable) representations of problems, logic, and search. Symbolic AI used tools such as logic programming, production rules, semantic nets and frames, and it developed applications such as knowledge-based systems (in particular, expert systems), symbolic mathematics, automated theorem provers, ontologies, the semantic web, and automated planning and scheduling systems. The Symbolic AI paradigm led to important ideas in search, symbolic programming languages, agents, multi-agent systems, the semantic web, and the strengths and limitations of formal knowledge and reasoning systems. Symbolic AI was the dominant paradigm of AI research from the mid-1950s until the mid-1990s. Researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the ultimate goal of their field. An early boom, with early successes such as the Logic Theorist and Samuel's Checkers Playing Program, led to unrealistic expectations and promises and was followed by the first AI Winter as funding dried up. A second boom (1969–1986) occurred with the rise of expert systems, their promise of capturing corporate expertise, and an enthusiastic corporate embrace. That boom, and some early successes, e.g., with XCON at DEC, was followed again by later disappointment. Problems with difficulties in knowledge acquisition, maintaining large knowledge bases, and brittleness in handling out-of-domain problems arose. Another, second, AI Winter (1988–2011) followed. Subsequently, AI researchers focused on addressing underlying problems in handling uncertainty and in knowledge acquisition. Uncertainty was addressed with formal methods such as hidden Markov models, Bayesian reasoning, and statistical relational learning. Symbolic machine learning addressed the knowledge acquisition problem with contributions including Version Space, Valiant's PAC learning, Quinlan's ID3 decision-tree learning, case-based learning, and inductive logic programming to learn relations. Neural networks, a subsymbolic approach, had been pursued from early days and reemerged strongly in 2012. Early examples are Rosenblatt's perceptron learning work, the backpropagation work of Rumelhart, Hinton and Williams, and work in convolutional neural networks by LeCun et al. in 1989. However, neural networks were not viewed as successful until about 2012: "Until Big Data became commonplace, the general consensus in the Al community was that the so-called neural-network approach was hopeless. Systems just didn't work that well, compared to other methods. ... A revolution came in 2012, when a number of people, including a team of researchers working with Hinton, worked out a way to use the power of GPUs to enormously increase the power of neural networks." Over the next several years, deep learning had spectacular success in handling vision, speech recognition, speech synthesis, image generation, and machine translation, though symbolic approaches continue to be useful in a few domains such as computer algebra systems and proof assistants. == History == A short history of symbolic AI to the present day follows below. Time periods and titles are drawn from Henry Kautz's 2020 AAAI Robert S. Engelmore Memorial Lecture and the longer Wikipedia article on the History of AI, with dates and titles differing slightly for increased clarity. === The first AI summer: irrational exuberance, 1948–1966 === Success at early attempts in AI occurred in three main areas: artificial neural networks, knowledge representation, and heuristic search, contributing to high expectations. This section summarizes Kautz's reprise of early AI history. ==== Approaches inspired by human or animal cognition or behavior ==== Cybernetic approaches attempted to replicate the feedback loops between animals and their environments. A robotic turtle, with sensors, motors for driving and steering, and seven vacuum tubes for control, based on a preprogrammed neural net, was built as early as 1948. This work can be seen as an early precursor to later work in neural networks, reinforcement learning, and situated robotics. An important early symbolic AI program was the Logic theorist, written by Allen Newell, Herbert Simon and Cliff Shaw in 1955–56, as it was able to prove 38 elementary theorems from Whitehead and Russell's Principia Mathematica. Newell, Simon, and Shaw later generalized this work to create a domain-independent problem solver, GPS (General Problem Solver). GPS solved problems represented with formal operators via state-space search using means-ends analysis. During the 1960s, symbolic approaches achieved great success at simulating intelligent behavior in structured environments such as game-playing, symbolic mathematics, and theorem-proving. AI research was concentrated in four institutions in the 1960s: Carnegie Mellon University, Stanford, MIT and (later) University of Edinburgh. Each one developed its own style of research. Earlier approaches based on cybernetics or artificial neural networks were abandoned or pushed into the background. Herbert Simon and Allen Newell studied human problem-solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science. Their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems. This tradition, centered at Carnegie Mellon University would eventually culminate in the development of the Soar architecture in the middle 1980s. ==== Heuristic search ==== In addition to the highly specialized domain-specific kinds of knowledge that we will see later used in expert systems, early symbolic AI researchers discovered another more general application of knowledge. These were called heuristics, rules of thumb that guide a search in promising directions: "How can non-enumerative search be practical when the underlying problem is exponentially hard? The approach advocated by Simon and Newell is to employ heuristics: fast algorithms that may fail on some inputs or output suboptimal solutions." Another important advance was to find a way to apply these heuristics that guarantees a solution will be found, if there is one, not withstanding the occasional fallibility of heuristics: "The A algorithm provided a general frame for complete and optimal heuristically guided search. A is used as a subroutine within practically every AI algorithm today but is still no magic bullet; its guarantee of completeness is bought at the cost of worst-case exponential time. ==== Early work on knowledge representation and reasoning ==== Early work covered both applications of formal reasoning emphasizing first-order logic, along with attempts to handle common-sense reasoning in a less formal manner. ===== Modeling formal reasoning with logic: the "neats" ===== Unlike Simon and Newell, John McCarthy felt that machines did not need to simulate the exact mechanisms of human thought, but could instead try to find the essence of abstract reasoning and problem-solving with logic, regardless of whether people used the same algorithms. His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning. Logic was also the focus of the work at the University of Edinburgh and elsewhere in Europe which led to the development of the programming language Prolog and the science of logic programming. ===== Modeling implicit common-sense knowledge with frames and scripts: the "scruffies" ===== Researchers at MIT (such as Marvin Minsky and Seymour Papert) found that solving difficult problems in vision and natural language processing required ad hoc solutions—they argued that no simple and general principle (like logic) would capture all the aspects of intelligent behavior. Roger Schank described their "anti-logic" approaches as "scruffy" (as opposed to the "neat" paradigms at CMU and Stanford). Commonsense knowledge bases (such as Doug Lenat's Cyc) are an example of "scruffy" AI, since they must be built by hand, one complicated concept at a time. === The first AI winter: crushed dreams, 1967–1977 === The first AI winter was a shock: During the first AI summer, many people thought that machine intelligence could be achieved in just a few years. The Defense Advance Research Projects Agency (DARPA) launched programs to support AI research to use AI to solve problems of national security; in particular, to automate the translation of Russian to English for inte

The Best Free AI Analytics Tool for Beginners

Trying to pick the best AI analytics tool? An AI analytics tool is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI analytics tool slots into your workflow and pays for itself fast. This guide breaks down the top picks, their pros and cons, and who each one is best for.

Is an AI Pair Programmer Worth It in 2026?

Shopping for the best AI pair programmer? An AI pair programmer is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI pair programmer slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

Collocation

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated. There are about seven main types of collocations: adjective + noun, noun + noun (such as collective nouns), noun + verb, verb + noun, adverb + adjective, verbs + prepositional phrase (phrasal verbs), and verb + adverb. Collocation extraction is a computational technique that finds collocations in a document or corpus, using various computational linguistics elements resembling data mining. == Expanded definition == Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. Such terms as crystal clear, middle management, nuclear family, and cosmetic surgery are examples of collocated pairs of words. Collocations can be in a syntactic relation (such as verb–object: make and decision), lexical relation (such as antonymy), or they can be in no linguistically defined relation. Knowledge of collocations is vital for the competent use of a language: a grammatically correct sentence will stand out as awkward if collocational preferences are violated. This makes collocation a common focus for language teaching. Corpus linguists specify a key word in context (KWIC) and identify the words immediately surrounding them, to illustrate the way words are used in practice. The processing of collocations involves a number of parameters, the most important of which is the measure of association, which evaluates whether the co-occurrence is purely by chance or statistically significant. Due to the non-random nature of language, most collocations are classed as significant, and the association scores are simply used to rank the results. Commonly used measures of association include mutual information, t scores, and log-likelihood. Rather than select a single definition, Gledhill proposes that collocation involves at least three different perspectives: co-occurrence, a statistical view, which sees collocation as the recurrent appearance in a text of a node and its collocates; construction, which sees collocation either as a correlation between a lexeme and a lexical-grammatical pattern, or as a relation between a base and its collocative partners; and expression, a pragmatic view of collocation as a conventional unit of expression, regardless of form. These different perspectives contrast with the usual way of presenting collocation in phraseological studies. Traditionally speaking, collocation is explained in terms of all three perspectives at once, in a continuum: == In dictionaries == In 1933, Harold Palmer's Second Interim Report on English Collocations highlighted the importance of collocation as a key to producing natural-sounding language, for anyone learning a foreign language. Thus from the 1940s onwards, information about recurrent word combinations became a standard feature of monolingual learner's dictionaries. As these dictionaries became "less word-centred and more phrase-centred", more attention was paid to collocation. This trend was supported, from the beginning of the 21st century, by the availability of large text corpora and intelligent corpus-querying software, making it possible to provide a more systematic account of collocation in dictionaries. Using these tools, dictionaries such as the Macmillan English Dictionary and the Longman Dictionary of Contemporary English included boxes or panels with lists of frequent collocations. There are also a number of specialized dictionaries devoted to describing the frequent collocations in a language. These include (for Spanish) Redes: Diccionario combinatorio del español contemporaneo (2004), (for French) Le Robert: Dictionnaire des combinaisons de mots (2007), and (for English) the LTP Dictionary of Selected Collocations (1997) and the Macmillan Collocations Dictionary (2010). == Statistically significant collocation == Student's t-test can be used to determine whether the occurrence of a collocation in a corpus is statistically significant. For a bigram w 1 w 2 {\displaystyle w_{1}w_{2}} , let P ( w 1 ) = # w 1 N {\displaystyle P(w_{1})={\frac {\#w_{1}}{N}}} be the unconditional probability of occurrence of w 1 {\displaystyle w_{1}} in a corpus with size N {\displaystyle N} , and let P ( w 2 ) = # w 2 N {\displaystyle P(w_{2})={\frac {\#w_{2}}{N}}} be the unconditional probability of occurrence of w 2 {\displaystyle w_{2}} in the corpus. The t-score for the bigram w 1 w 2 {\displaystyle w_{1}w_{2}} is calculated as: where x ¯ = # w i w j N {\displaystyle {\bar {x}}={\frac {\#w_{i}w_{j}}{N}}} is the sample mean of the occurrence of w 1 w 2 {\displaystyle w_{1}w_{2}} , # w 1 w 2 {\displaystyle \#w_{1}w_{2}} is the number of occurrences of w 1 w 2 {\displaystyle w_{1}w_{2}} , μ = P ( w i ) P ( w j ) {\displaystyle \mu =P(w_{i})P(w_{j})} is the probability of w 1 w 2 {\displaystyle w_{1}w_{2}} under the null-hypothesis that w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} appear independently in the text, and s 2 = x ¯ ( 1 − x ¯ ) ≈ x ¯ {\displaystyle s^{2}={\bar {x}}(1-{\bar {x}})\approx {\bar {x}}} is the sample variance. With a large N {\displaystyle N} , the t-test is equivalent to a Z-test.

Cyborg

A cyborg () is a being with both organic and biomechatronic body parts. It is a portmanteau of cybernetic and organism. The term was coined in 1960 by Manfred Clynes and Nathan S. Kline. In contrast to biorobots and androids, the term cyborg applies to a living organism that has undergone restoration of function or enhancements of abilities due to the integration of some artificial component or technology that relies on feedback. == Description and definition == Alternative names for a cyborg include cybernetic organism, cyber-organism, cyber-organic being, cybernetically enhanced organism, cybernetically augmented organism, technorganic being, techno-organic being, and techno-organism. Unlike bionics, biorobotics, or androids, a cyborg is an organism that has restored function or, especially, enhanced abilities due to the integration of some artificial component or technology that relies on some sort of feedback, for example: prostheses, artificial organs, implants or, in some cases, wearable technology. Cyborg technologies may enable or support collective intelligence. A related idea is the "augmented human". While cyborgs are commonly thought of as mammals, including humans, the term can apply to any organism. === Placement and distinctions === D. S. Halacy's Cyborg: Evolution of the Superman (1965) featured an introduction which spoke of a "new frontier" that was "not merely space, but more profoundly the relationship between 'inner space' to 'outer space' – a bridge...between mind and matter." In "A Cyborg Manifesto", Donna Haraway rejects the notion of rigid boundaries between humanity and technology, arguing that, as humans depend on more technology over time, humanity and technology have become too interwoven to draw lines between them. She believes that since we have allowed and created machines and technology to be so advanced, there should be no reason to fear what we have created, and cyborgs should be embraced because they are part of human identities. However, Haraway has also expressed concern over the contradictions of scientific objectivity and the ethics of technological evolution, and has argued that "There are political consequences to scientific accounts of the world." === Biosocial definition === According to some definitions of the term, the physical attachments that humans have with even the most basic technologies have already made them cyborgs. In a typical example, a human with an artificial cardiac pacemaker or implantable cardioverter-defibrillator would be considered a cyborg, since these devices measure voltage potentials in the body, perform signal processing, and can deliver electrical stimuli, using a synthetic feedback mechanism to keep that person alive. Implants, especially cochlear implants, that combine mechanical modification with any kind of feedback response are also cybernetic enhancements. Some theorists cite such modifications as contact lenses, hearing aids, smartphones, or intraocular lenses as examples of fitting humans with technology to enhance their biological capabilities. The emerging trend of implanting microchips inside the body (mainly the hands), to make financial operations like a contactless payment, or basic tasks like opening a door, has been erroneously marketed as more recent examples of cybernetic enhancement. The latter has not yet seen significant traction outside niche areas in Scandinavia and in actual function is little more than a pre-programmed Radio-frequency identification (RFID) microchip encased in glass that does not interact with the human body (it is the same technology used in the microchips injected into animals for ease of identification), thus not fitting the definition of a cybernetic implant. As cyborgs currently are on the rise, some theorists argue there is a need to develop new definitions of aging. For instance, a bio-techno-social definition of aging has been suggested. The term is also used to address human-technology mixtures in the abstract. This includes not only commonly used pieces of technology such as phones, computers, the Internet, and so on, but also artifacts that are not usually considered technology; for example, pen and paper, and speech and language. When augmented with these technologies and connected in communication with people in other times and places, a person becomes capable of more than they were before. An example is a computer, which gains power by using Internet protocols to connect with other computers. Another example is a social-media bot—either a bot-assisted human or a human-assisted-bot—used to target social media with likes and shares. Cybernetic technologies thus include highways, pipes, electrical wiring, buildings, electrical plants, libraries, and other infrastructural constructs. Bruce Sterling, in his Shaper/Mechanist universe, suggested an idea of an alternative cyborg called 'Lobster', which is made not by using internal implants, but by using an external shell (e.g. a powered exoskeleton). The computer game Deus Ex: Invisible War prominently features cyborgs called Omar, Russian for 'lobster'. === Evolutionary perspective === In 1994, Hans Hass formulated a scientific view of the human-machine hybrids he called "hypercells". They can expand their biological cell body with artificial artifacts and thus expand their performance body. The theory of hypercells or Homo proteus, as Hass called the human-machine hybrid to distinguish Homo sapiens, extends Charles Darwin's theory of evolution and deals with the course of evolution beyond humans. In his 2019 book Novacene, James Lovelock used the term "cyborgs" to refer to the next generation of beings who will become the "understanders of the future" and "lead the cosmos to self-knowledge". While acknowledging the organic component in Clynes' and Kline's definition, he proposed that these cyborgs "will have designed and built themselves from the artificial intelligence systems we have already constructed", and used the term cyborg "to emphasize that the new intelligent beings will have arisen, like us, from Darwinian evolution." == Origins == The concept of a man-machine mixture was widespread in science fiction before World War II. As early as 1843, Edgar Allan Poe described a man with extensive prostheses in the short story "The Man That Was Used Up". In 1911, Jean de La Hire introduced the Nyctalope, a science fiction hero who was perhaps the first literary cyborg, in Le Mystère des XV (later translated as The Nyctalope on Mars). Nearly two decades later, Edmond Hamilton presented space explorers with a mixture of organic and machine parts in his 1928 novel The Comet Doom. He later featured the talking, living brain of an old scientist, Simon Wright, floating in a transparent case, and in all the adventures of his famous hero, Captain Future. In 1944, in the short story "No Woman Born", C. L. Moore wrote of Deirdre, a dancer, whose body was burned completely and whose brain was placed in a faceless but beautiful and supple mechanical body. In 1960, the term "cyborg" was coined by Manfred E. Clynes and Nathan S. Kline to refer to their conception of an enhanced human being who could survive in extraterrestrial environments: For the exogenously extended organizational complex functioning as an integrated homeostatic system unconsciously, we propose the term 'Cyborg'. Their concept was the outcome of thinking about the need for an intimate relationship between human and machine as the new frontier of space exploration was beginning to develop. A designer of physiological instrumentation and electronic data-processing systems, Clynes was the chief research scientist in the Dynamic Simulation Laboratory at Rockland State Hospital in New York. The term first appears in print 5 months earlier when The New York Times reported on the "Psychophysiological Aspects of Space Flight Symposium" where Clynes and Kline first presented their paper: A cyborg is essentially a man-machine system in which the control mechanisms of the human portion are modified externally by drugs or regulatory devices so that the being can live in an environment different from the normal one. Thereafter, Hamilton would first use the term "cyborg" explicitly in the 1962 short story, "After a Judgment Day", to describe the "mechanical analogs" called "Charlies," explaining that "[c]yborgs, they had been called from the first one in the 1960s...cybernetic organisms." The 1972 novel Cyborg by Martin Caidin introduced the character of bionic government agent Steve Austin, and was adapted into the popular television series The Six Million Dollar Man, which ran from 1973 to 1978. In 2001, a book titled Cyborg: Digital Destiny and Human Possibility in the Age of the Wearable computer was published by Doubleday. Some of the ideas in the book were incorporated into the documentary film Cyberman that same year. == Cyborg tissues in engineering == Cyborg tissues structured with carbon nanotubes and plan

Ranking SVM

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems (via learning to rank). The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT. == Description == The ranking SVM algorithm is a learning retrieval function that employs pairwise ranking methods to adaptively sort results based on how 'relevant' they are for a specific query. The ranking SVM function uses a mapping function to describe the match between a search query and the features of each of the possible results. This mapping function projects each data pair (such as a search query and clicked web-page, for example) onto a feature space. These features are combined with the corresponding click-through data (which can act as a proxy for how relevant a page is for a specific query) and can then be used as the training data for the ranking SVM algorithm. Generally, ranking SVM includes three steps in the training period: It maps the similarities between queries and the clicked pages onto a certain feature space. It calculates the distances between any two of the vectors obtained in step 1. It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver. == Background == === Ranking method === Suppose C {\displaystyle \mathbb {C} } is a data set containing N {\displaystyle N} elements c i {\displaystyle c_{i}} . r {\displaystyle r} is a ranking method applied to C {\displaystyle \mathbb {C} } . Then the r {\displaystyle r} in C {\displaystyle \mathbb {C} } can be represented as a N × N {\displaystyle N\times N} binary matrix. If the rank of c i {\displaystyle c_{i}} is higher than the rank of c j {\displaystyle c_{j}} , i.e. r c i < r c j {\displaystyle r\ c_{i}