AI Video Tools

Explore the best AI Video Tools — independent reviews, comparisons, pricing and step-by-step how-to guides, curated by Aizhi.

  • ArcSoft ShowBiz

    ArcSoft ShowBiz

    ShowBiz is a video editor by ArcSoft for the Windows operating system. It can create VCD and DVDs and can also export to the formats AVI, MPEG, WMV, and MOV. ShowBiz also contains a DVD burning and menu building feature. As of 2003, it was one of the three most dominant bundled titles. == Reception == PC Magazine reviewer Jan Ozer states: "ArcSoft's ShowBiz has evolved into a competent editor that's generally more usable than Dazzle's MovieStar program, providing more configuration controls, better preview features, and a much greater range of fun effects." John Virata, senior editor of Digital Media Online, says in his three page review of ShowBiz DVD 2, "It is an easy editor to work with and has a logically laid out interface that takes you step by step through the video creation and DVD creation process"

    Read more →
  • Thompson's construction

    Thompson's construction

    In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson. Regular expressions and nondeterministic finite automata are two representations of formal languages. For instance, text processing utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can compile regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages. An NFA can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also be interpreted directly. To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimal deterministic finite automaton via Thompson's construction, powerset construction, and DFA minimization. If, and only if, the resulting automata agree up to renaming of states, the regular expressions' languages agree. == The algorithm == The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules. More precisely, from a regular expression E, the obtained automaton A with the transition function Δ respects the following properties: A has exactly one initial state q0, which is not accessible from any other state. That is, for any state q and any letter a, Δ ( q , a ) {\displaystyle \Delta (q,a)} does not contain q0. A has exactly one final state qf, which is not co-accessible from any other state. That is, for any letter a, Δ ( q f , a ) = ∅ {\displaystyle \Delta (q_{f},a)=\emptyset } . Let c be the number of concatenation of the regular expression E and let s be the number of symbols apart from parentheses — that is, |, , a and ε. Then, the number of states of A is 2s − c (linear in the size of E). The number of transitions leaving any state is at most two. Since an NFA of m states and at most e transitions from each state can match a string of length n in time O(emn), a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet. === Rules === The following rules are depicted according to Aho et al. (2007), p. 122. In what follows, N(s) and N(t) are the NFA of the subexpressions s and t, respectively. The empty-expression ε is converted to A symbol a of the input alphabet is converted to The union expression s|t is converted to State q goes via ε either to the initial state of N(s) or N(t). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA. The concatenation expression st is converted to The initial state of N(s) is the initial state of the whole NFA. The final state of N(s) becomes the initial state of N(t). The final state of N(t) is the final state of the whole NFA. The Kleene star expression s is converted to An ε-transition connects initial and final state of the NFA with the sub-NFA N(s) in between. Another ε-transition from the inner final to the inner initial state of N(s) allows for repetition of expression s according to the star operator. The parenthesized expression (s) is converted to N(s) itself. With these rules, using the empty expression and symbol rules as base cases, it is possible to prove with structural induction that any regular expression may be converted into an equivalent NFA. == Example == Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm. === Small Example === The picture below shows the result of Thompson's construction on (ε|ab). The purple oval corresponds to a, the teal oval corresponds to a, the green oval corresponds to b, the orange oval corresponds to ab, and the blue oval corresponds to ε. === Application of the algorithm === As an example, the picture shows the result of Thompson's construction algorithm on the regular expression (0|(1(01(00)0)1)) that denotes the set of binary numbers that are multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ... }. The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are named a-q for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with the entry and exit state of each subexpression colored in magenta and cyan, respectively. An ε as transition label is omitted for clarity — unlabelled transitions are in fact ε transitions. The entry and exit state corresponding to the root expression q is the start and accept state of the automaton, respectively. The algorithm's steps are as follows: An equivalent minimal deterministic automaton is shown below. == Relation to other algorithms == Thompson's is one of several algorithms for constructing NFAs from regular expressions; an earlier algorithm was given by McNaughton and Yamada. Converse to Thompson's construction, Kleene's algorithm transforms a finite automaton into a regular expression. Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed. == Use in string pattern matching == Regular expressions are often used to specify patterns that software is then asked to match. Generating an NFA by Thompson's construction, and using an appropriate algorithm to simulate it, it is possible to create pattern-matching software with performance that is ⁠ O ( m n ) {\displaystyle O(mn)} ⁠, where m is the length of the regular expression and n is the length of the string being matched. This is much better than is achieved by many popular programming-language implementations; however, it is restricted to purely regular expressions and does not support patterns for non-regular languages like backreferences.

    Read more →
  • StarDict

    StarDict

    StarDict, developed by Hu Zheng (胡正), is a free GUI released under the GPL-3.0-or-later license for accessing StarDict dictionary files (a dictionary shell). It is the successor of StarDic, developed by Ma Su'an (馬蘇安), continuing its version numbers. According to StarDict's earlier homepage on SourceForge, the project has been removed from SourceForge due to copyright infringement reports. It moved to Google Code and then back to SourceForge, while development is now seemingly continued on GitHub. == Supported platforms == StarDict runs under Linux, Windows, FreeBSD, Maemo and Solaris. Dictionaries of the user's choice are installed separately. Dictionary files can be created by converting dict files. Several programs compatible with the StarDict dictionary format are available for different platforms. For the iPhone, iPod Touch and iPad, applications available in the App Store include GuruDic, TouchDict, weDict, Dictionary Universal, Alpus and others, as well as the free iStarDict, which is available for the Cydia Store. == Dictionaries available == One can find here the partial list of FreeDict dictionaries which can be converted to the StarDict format. These include, in particular, some older versions of Webster's dictionary and many dictionaries for various languages. == Features == While StarDict is in scan mode, results are displayed in a tooltip, allowing easy dictionary lookup. When combined with Freedict, StarDict will quickly provide rough translations of foreign language websites. On September 25, 2006, an online version of Stardict began operation. This online version includes access to all the major dictionaries of StarDict, as well as Wikipedia in Chinese. Previous versions of StarDict were very similar to the PowerWord dictionary program, which is developed by a Chinese company, KingSoft. Since version 2.4.2, however, StarDict has diverged from the design of PowerWord by increasing its search capabilities and adding lexicons in a variety of languages. This was assisted by the collaboration of many developers with the author. == sdcv == Evgeniy A. Dushistov produced a command line version of StarDict called sdcv. It employed all the dictionary files that belong to StarDict. It is written in C++ and licensed under the terms of the GNU General Public License. sdcv runs under Linux, FreeBSD, and Solaris. As in StarDict, dictionaries of the user's choice have to be installed separately. At the end of 2006, software developer Hu Zheng cited personal financial problems as an excuse to charge users for downloading dictionary files from his website, which temporarily aroused strong doubts and dissatisfaction in the Linux community. In the end, under the pressure of public opinion, the charging plan was forced to be canceled and ended hastily.

    Read more →
  • Flex (lexical analyzer generator)

    Flex (lexical analyzer generator)

    Flex (fast lexical analyzer generator) is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers (also known as "scanners" or "lexers"). It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems (as both lex and yacc are part of POSIX), or together with GNU bison (a version of yacc) in BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation. == History == Flex was written in C around 1987 by Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson. == Example lexical analyzer == This is an example of a Flex scanner for the instructional programming language PL/0. The tokens recognized are: '+', '-', '', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>='; numbers: 0-9 {0-9}; identifiers: a-zA-Z {a-zA-Z0-9} and keywords: begin, call, const, do, end, if, odd, procedure, then, var, while. == Internals == These programs perform character parsing and tokenizing via the use of a deterministic finite automaton (DFA). A DFA is a theoretical machine accepting regular languages, and is equivalent to read-only right moving Turing machines. The syntax is based on the use of regular expressions. See also nondeterministic finite automaton. == Issues == === Time complexity === A Flex lexical analyzer usually has time complexity O ( n ) {\displaystyle O(n)} in the length of the input. That is, it performs a constant number of operations for each input symbol. This constant is quite low: GCC generates 12 instructions for the DFA match loop. Note that the constant is independent of the length of the token, the length of the regular expression and the size of the DFA. However, using the REJECT macro in a scanner with the potential to match extremely long tokens can cause Flex to generate a scanner with non-linear performance. This feature is optional. In this case, the programmer has explicitly told Flex to "go back and try again" after it has already matched some input. This will cause the DFA to backtrack to find other accept states. The REJECT feature is not enabled by default, and because of its performance implications its use is discouraged in the Flex manual. === Reentrancy === By default the scanner generated by Flex is not reentrant. This can cause serious problems for programs that use the generated scanner from different threads. To overcome this issue there are options that Flex provides in order to achieve reentrancy. A detailed description of these options can be found in the Flex manual. === Usage under non-Unix environments === Normally the generated scanner contains references to the unistd.h header file, which is Unix specific. To avoid generating code that includes unistd.h, %option nounistd should be used. Another issue is the call to isatty (a Unix library function), which can be found in the generated code. The %option never-interactive forces flex to generate code that does not use isatty. === Using flex from other languages === Flex can only generate code for C and C++. To use the scanner code generated by flex from other languages a language binding tool such as SWIG can be used. === Unicode support === Flex is limited to matching 1-byte (8-bit) binary values and therefore does not support Unicode. RE/flex and other alternatives do support Unicode matching. == Flex++ == flex++ is a similar lexical scanner for C++ which is included as part of the flex package. The generated code does not depend on any runtime or external library except for a memory allocator (malloc or a user-supplied alternative) unless the input also depends on it. This can be useful in embedded and similar situations where traditional operating system or C runtime facilities may not be available. The flex++ generated C++ scanner includes the header file FlexLexer.h, which defines the interfaces of the two C++ generated classes.

    Read more →
  • Artificial Inventor Project

    Artificial Inventor Project

    The Artificial Inventor Project (AIP) is a global legal initiative headed by Professor Ryan Abbott dedicated to pursuing intellectual property (IP) rights for inventions and creative works generated autonomously by artificial intelligence (AI) systems without traditional human inventorship or authorship. The project coordinates a series of pro bono test cases worldwide, aiming to prompt law reform and public debate on how IP law should accommodate non-human creators. == History == In 2019, AIP filed patent applications in multiple jurisdictions, including the United States, United Kingdom, European Patent Office, Australia, Switzerland, and South Africa, naming the AI system DABUS (Device for the Autonomous Bootstrapping of Unified Sentience), created by Stephen Thaler, as the inventor. The aim was to challenge legal norms that require inventors to be natural persons and highlight pressing policy questions about AI-generated innovation and IP regimes. == Legal proceedings by jurisdiction == === Australia === In July 2021, a Federal Court of Australia judge (Beach J) ruled that AI can be considered an inventor under the Patents Act 1990, ordering IP Australia to reinstate the relevant patent. However, the full court then overturned this ruling on appeal and denied further review. === European Patent Office === The EPO Board of Appeal determined in 2022 that only a human inventor may be named, rendering DABUS‑based applications unacceptable. === South Africa === In 2021, a patent was granted listing DABUS as the inventor. As South Africa’s procedural system does not involve substantive inventorship review, the grant proceeded on formal grounds alone. === Switzerland === On 26 June 2025, the Swiss Federal Administrative Court ruled that artificial intelligence systems such as DABUS cannot be listed as inventors on patent applications. The court upheld the existing practice of the Swiss Federal Institute of Intellectual Property (IPI), affirming that only natural persons may be recognized as inventors under Swiss patent law. === United Kingdom === In December 2023, the UK Supreme Court unanimously held that AI systems cannot be legally recognized as inventors, affirming that "an inventor must be a person" under current British law. === United States === In Thaler v. Hirshfeld (2021), a U.S. federal court agreed with the USPTO that inventors must be natural persons, rejecting the DABUS application and setting a precedent consistent with existing statute and administrative policy. == Criticism and impact == The project has fueled substantial discourse. Critics caution that allowing AI inventorship may complicate notions of accountability and ownership. Proponents argue that legal recognition must evolve to avoid disincentivizing innovation produced by AI and to maintain honesty about the true source of invention.

    Read more →
  • AI Video Generators Reviews: What Actually Works in 2026

    AI Video Generators Reviews: What Actually Works in 2026

    Comparing the best AI video generator? An AI video generator is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI video generator slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • Maike Osborne

    Maike Osborne

    Maike Osborne (born Michael Osborne, 1982) is an Australian academic and scientist who serves as a professor of machine learning at University of Oxford in the Machine Learning Research Group in the Department of Engineering Science. In 2016 she co-founded Mind Foundry, an artificial intelligence company, along with fellow professor Stephen Roberts. == Education == She has a BEng in Mechanical Engineering and a BSc in both Pure Mathematics and Physics from the University of Western Australia. She has a PhD in Machine Learning from the University of Oxford. == Career == Osborne has contributed to over 100 publications, and her work has received over 24,000 citations with an h-index of 46 according to Google Scholar. and has acted as principal or co-investigator for £10.6M of research funding. Her career has focused in particular on Bayesian approaches to AI and machine learning, named after the famous British statistician Thomas Bayes. Osborne's work has contributed to Probabilistic numerics, with Osborne co-authoring the first textbook on the subject. In 2013, Osborne co-authored a paper alongside Swedish-German economist Carl Benedikt Frey called "The Future of Employment: How Susceptible are Jobs to Computerisation?". The paper has received over 13,000 citations and extensive media coverage. In 2023 Osborne gave oral evidence to the UK House of Commons Science and Technology Committee on the subject of the "Governance of Artificial Intelligence". Her testimony received significant coverage around her warnings of the threat of "rogue AI". == Honors == She is also an Official Fellow of Exeter College, and St Peter's College, Oxford, a Fellow of the ELLIS society, and a Faculty Member of the Oxford-Man Institute of Quantitative Finance. She joined the Oxford Martin School as Lead Researcher on the Oxford Martin Programme on Technology and Employment in 2015. She is a Director of the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems.

    Read more →
  • How to Choose an AI Content Generator

    How to Choose an AI Content Generator

    Curious about the best AI content generator? An AI content generator is software that uses machine learning to help you get more done — it combines speed, accuracy, and an interface that just works. Hands-on testing shows real-world results vary, so a short free trial is the smartest way to decide. Whether you are a beginner or a pro, the right AI content generator slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →
  • CLAWS (linguistics)

    CLAWS (linguistics)

    The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96–97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus. == History == A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. Developed in the early 1980s, CLAWS was built to fill the ever-growing gap created by always-changing POS necessities. Originally created to add part-of-speech tags to the LOB corpus of British English, the CLAWS tagset has since been adapted to other languages as well, including Urdu and Arabic. Since its inception, CLAWS has been hailed for its functionality and adaptability. Still, it is not without flaws, and though it boasts an error-rate of only 1.5% when judged in major categories, CLAWS still remains with c.3.3% ambiguities unresolved. Ambiguity arises in cases such as with the word flies, and whether it should be classified as a noun or a verb. It's these ambiguities that will require the various upgrades and tagsets that CLAWS will endure. == Rules and processing == CLAWS uses a Hidden Markov model to determine the likelihood of sequences of words in anticipating each part-of-speech label. === Sample output === This excerpt from Bram Stoker's Dracula (1897) has been tagged using both the CLAWS C5 and C7 tagsets. This is what a CLAWS output will generally look like, with the most likely part-of-speech tag following each word. == Tagsets == === CLAWS1 tagset === The first tagset developed in CLAWS, CLAWS1 tagset, has 132 word tags. In terms of form and application, C1 tagset is similar to Brown Corpus tags. See Table of tags in C1 tagset here. === CLAWS2 tagset === From 1983 to 1986, updated versions leading to CLAWS2 were part of a larger attempt to deal with aspects such as recognizing sentence breaks, in order to avoid the need for manual pre-processing of a text before the tags were applied, moving instead to optional manual post-editing to adjust the output of the automatic annotation, if needed. The CLAWS2 tagset has 166 word tags. See Table of tags in C2 tagset here. === CLAWS4 tagset === The CLAWS4 was used for the 100-million-word British National Corpus (BNC). A general-purpose grammatical tagger, it is a successor of the CLAWS1 tagger. In tagging the BNC, the many rounds of work that went into CLAWS4 focused on making the CLAWS program independent from the tagsets. For example, the BNC project used two tagset versions: "a main tagset (C5) with 62 tags with which the whole of the corpus has been tagged, and a larger (C7) tagset with 152 tags, which has been used to make a selected 'core' sample corpus of two million words." The latest version of CLAWS4 is offered by UCREL, a research center of Lancaster University. === CLAWS5 tagset === The CLAWS5 tagset, which was used for BNC, has over 60 tags. See Table of tags in C5 tagset here. === CLAWS6 tagset === The CLAWS6 tagset was used for the BNC sampler corpus and the COLT corpus. It has over 160 tags, including 13 determiner subtypes. See Table of tags in C6 tagset here. === CLAWS7 tagset === The standard CLAWS7 tagset is used currently. It is only different in the punctuation tags when compared to the CLAWS6 tagset. See Table of tags in C7 tagset here. === CLAWS8 tagset === CLAWS8 tagset was extended from C7 tagset with further distinctions in the determiner and pronoun categories, as well as 37 new auxiliary tags for forms of be, do, and have. See Table of tags in C8 tagset here

    Read more →
  • European Association for Machine Translation

    European Association for Machine Translation

    The European Association for Machine Translation is the European branch of the International Association for Machine Translation Archived 2010-06-24 at the Wayback Machine. It is a non-profit organisation and organises conferences and workshops on the subject of machine translation. It was registered in 1991 in Switzerland and is the only organisation of its type in Europe.

    Read more →
  • Additive smoothing

    Additive smoothing

    In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts x = ⟨ x 1 , x 2 , … , x d ⟩ {\displaystyle \mathbf {x} =\langle x_{1},x_{2},\ldots ,x_{d}\rangle } from a d {\displaystyle d} -dimensional multinomial distribution with N {\displaystyle N} trials, a "smoothed" version of the counts gives the estimator θ ^ i = x i + α N + α d ( i = 1 , … , d ) , {\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\alpha }{N+\alpha d}}\qquad (i=1,\ldots ,d),} where the smoothed count x ^ i = N θ ^ i {\displaystyle {\hat {x}}_{i}=N{\hat {\theta }}_{i}} , and the "pseudocount" α > 0 is a smoothing parameter, with α = 0 corresponding to no smoothing (this parameter is explained in § Pseudocount below). Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical probability (relative frequency) x i / N {\displaystyle x_{i}/N} and the uniform probability 1 / d . {\displaystyle 1/d.} Common choices for α are 0 (no smoothing), +1⁄2 (the Jeffreys prior), or 1 (Laplace's rule of succession), but the parameter may also be set empirically based on the observed data. From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter α as a prior distribution. In the special case where the number of categories is 2, this is equivalent to using a beta distribution as the conjugate prior for the parameters of the binomial distribution. == History == Laplace came up with this smoothing technique when he tried to estimate the chance that the sun will rise tomorrow. His rationale was that even given a large sample of days with the rising sun, we still can not be completely sure that the sun will still rise tomorrow (known as the sunrise problem). == Pseudocount == A pseudocount is an amount (not generally an integer, despite its name) added to the number of observed cases in order to change the expected probability in a model of those data, when not known to be zero. It is so named because, roughly speaking, a pseudo-count of value α {\displaystyle \alpha } weighs into the posterior distribution similarly to each category having an additional count of α {\displaystyle \alpha } . If the number of occurrences of each item i {\displaystyle i} is x i {\displaystyle x_{i}} out of N {\displaystyle N} samples, the empirical probability of event i {\displaystyle i} is p i , empirical = x i N , {\displaystyle p_{i,{\text{empirical}}}={\frac {x_{i}}{N}},} but the posterior probability when additively smoothed is p i , α -smoothed = x i + α N + α d , {\displaystyle p_{i,\alpha {\text{-smoothed}}}={\frac {x_{i}+\alpha }{N+\alpha d}},} as if to increase each count x i {\displaystyle x_{i}} by α {\displaystyle \alpha } a priori. Depending on the prior knowledge, which is sometimes a subjective value, a pseudocount may have any non-negative finite value. It may only be zero (or the possibility ignored) if impossible by definition, such as the possibility of a decimal digit of π being a letter, or a physical possibility that would be rejected and so not counted, such as a computer printing a letter when a valid program for π is run, or excluded and not counted because of no interest, such as if only interested in the zeros and ones. Generally, there is also a possibility that no value may be computable or observable in a finite time (see the halting problem). But at least one possibility must have a non-zero pseudocount, otherwise no prediction could be computed before the first observation. The relative values of pseudocounts represent the relative prior expected probabilities of their possibilities. The sum of the pseudocounts, which may be very large, represents the estimated weight of the prior knowledge compared with all the actual observations (one for each) when determining the expected probability. In any observed data set or sample there is the possibility, especially with low-probability events and with small data sets, of a possible event not occurring. Its observed frequency is therefore zero, apparently implying a probability of zero. This oversimplification is inaccurate and often unhelpful, particularly in probability-based machine learning techniques such as artificial neural networks and hidden Markov models. By artificially adjusting the probability of rare (but not impossible) events so those probabilities are not exactly zero, zero-frequency problems are avoided. Also see Cromwell's rule. === Choice of pseudocount === ==== Weakly informative prior ==== One common approach is to add 1 to each observed number of events, including the zero-count possibilities. This is sometimes called Laplace's rule of succession. This approach is equivalent to assuming a uniform prior distribution over the probabilities for each possible event (spanning the simplex where each probability is between 0 and 1, and they all sum to 1). Using the Jeffreys prior approach, a pseudocount of one half should be added to each possible outcome. Pseudocounts should be set to one or one-half only when there is no prior knowledge at all – see the principle of indifference. However, given appropriate prior knowledge, the sum should be adjusted in proportion to the expectation that the prior probabilities should be considered correct, despite evidence to the contrary – see further analysis. Higher values are appropriate inasmuch as there is prior knowledge of the true values (for a mint-condition coin, say); lower values inasmuch as there is prior knowledge that there is probable bias, but of unknown degree (for a bent coin, say). ==== Frequentist interval ==== One way to motivate pseudocounts, particularly for binomial data, is via a formula for the midpoint of an interval estimate, particularly a binomial proportion confidence interval. The best-known is due to Edwin Bidwell Wilson, in Wilson (1927): the midpoint of the Wilson score interval corresponding to ⁠ z {\displaystyle z} ⁠ standard deviations on either side is n S + z n + 2 z {\displaystyle {\frac {n_{S}+z}{n+2z}}} Taking z = 2 {\displaystyle z=2} standard deviations to approximate a 95% confidence interval (⁠ z ≈ 1.96 {\displaystyle z\approx 1.96} ⁠) yields pseudocount of 2 for each outcome, so 4 in total, colloquially known as the "plus four rule": n S + 2 n + 4 {\displaystyle {\frac {n_{S}+2}{n+4}}} This is also the midpoint of the Agresti–Coull interval (Agresti & Coull 1998). ==== Known incidence rates ==== Often the bias of an unknown trial population is tested against a control population with known parameters (incidence rates) μ = ⟨ μ 1 , μ 2 , … , μ d ⟩ . {\displaystyle {\boldsymbol {\mu }}=\langle \mu _{1},\mu _{2},\ldots ,\mu _{d}\rangle .} In this case the uniform probability 1 / d {\displaystyle 1/d} should be replaced by the known incidence rate of the control population μ i {\displaystyle \mu _{i}} to calculate the smoothed estimator: θ ^ i = x i + μ i α d N + α d ( i = 1 , … , d ) . {\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\mu _{i}\alpha d}{N+\alpha d}}\qquad (i=1,\ldots ,d).} As a consistency check, if the empirical estimator happens to equal the incidence rate, i.e. μ i = x i / N , {\displaystyle \mu _{i}=x_{i}/N,} the smoothed estimator is independent of α {\displaystyle \alpha } and also equals the incidence rate. == Applications == === Classification === Additive smoothing is commonly a component of naive Bayes classifiers. === Statistical language modelling === In a bag of words model of natural language processing and information retrieval, the data consists of the number of occurrences of each word in a document. Additive smoothing allows the assignment of non-zero probabilities to words which do not occur in the sample. Studies have shown that additive smoothing is more effective than other probability smoothing methods in several retrieval tasks such as language-model-based pseudo-relevance feedback and recommender systems.

    Read more →
  • Markovian discrimination

    Markovian discrimination

    Markovian discrimination is a class of spam filtering methods used in CRM114 and other spam filters to filter based on statistical patterns of transition probabilities between words or other lexical tokens in spam messages that would not be captured using simple bag-of-words naive Bayes spam filtering. == Markovian Discrimination vs. Bag-of-Words Discrimination == A bag-of-words model contains only a dictionary of legal words and their relative probabilities in spam and genuine messages. A Markovian model additionally includes the relative transition probabilities between words in spam and in genuine messages, where the relative transition probability is the likelihood that a given word will be written next, based on what the current word is. Put another way, a bag-of-words filter discriminates based on relative probabilities of single words alone regardless of phrase structure, while a Markovian word-based filter discriminates based on relative probabilities of either pairs of words, or, more commonly, short sequences of words. This allows the Markovian filter greater sensitivity to phrase structure. Neither naive Bayes nor Markovian filters are limited to the word level for tokenizing messages. They may also process letters, partial words, or phrases as tokens. In such cases, specific bag-of-words methods would correspond to general bag-of-tokens methods. Modelers can parameterize Markovian spam filters based on the relative probabilities of any such tokens' transitions appearing in spam or in legitimate messages. == Visible and Hidden Markov Models == There are two primary classes of Markov models, visible Markov models and hidden Markov models, which differ in whether the Markov chain generating token sequences is assumed to have its states fully determined by each generated token (the visible Markov models) or might also have additional state (the hidden Markov models). With a visible Markov model, each current token is modeled as if it contains the complete information about previous tokens of the message relevant to the probability of future tokens, whereas a hidden Markov model allows for more obscure conditional relationships. Since those more obscure conditional relationships are more typical of natural language messages including both genuine messages and spam, hidden Markov models are generally preferred over visible Markov models for spam filtering. Due to storage constraints, the most commonly employed model is a specific type of hidden Markov model known as a Markov random field, typically with a 'sliding window' or clique size ranging between four and six tokens.

    Read more →
  • Weak supervision

    Weak supervision

    Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to the large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. == Problem == The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning. == Technique == More formally, semi-supervised learning assumes a set of l {\displaystyle l} independently identically distributed examples x 1 , … , x l ∈ X {\displaystyle x_{1},\dots ,x_{l}\in X} with corresponding labels y 1 , … , y l ∈ Y {\displaystyle y_{1},\dots ,y_{l}\in Y} and u {\displaystyle u} unlabeled examples x l + 1 , … , x l + u ∈ X {\displaystyle x_{l+1},\dots ,x_{l+u}\in X} are processed. Semi-supervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning. Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data x l + 1 , … , x l + u {\displaystyle x_{l+1},\dots ,x_{l+u}} only. The goal of inductive learning is to infer the correct mapping from X {\displaystyle X} to Y {\displaystyle Y} . It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably. == Assumptions == In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions: === Continuity / smoothness assumption === Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes. === Cluster assumption === The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms. === Manifold assumption === The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold. The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds, and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively. == History == The heuristic approach of self-training (also known as self-learning or self-labeling) is historically the oldest approach to semi-supervised learning, with examples of applications starting in the 1960s. The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s. Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995. == Methods == === Generative models === Generative approaches to statistical learning first seek to estimate p ( x | y ) {\displaystyle p(x|y)} , the distribution of data points belonging to each class. The probability p ( y | x ) {\displaystyle p(y|x)} that a given point x {\displaystyle x} has label y {\displaystyle y} is then proportional to p ( x | y ) p ( y ) {\displaystyle p(x|y)p(y)} by Bayes' rule. Semi-supervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about p ( x ) {\displaystyle p(x)} ) or as an extension of unsupervised learning (clustering plus some labels). Generative models assume that the distributions take some particular form p ( x | y , θ ) {\displaystyle p(x|y,\theta )} parameterized by the vector θ {\displaystyle \theta } . If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone. However, if the assumptions are correct, then the unlabeled data necessarily improves performance. The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models. The parameterized joint distribution can be written as p ( x , y | θ ) = p ( y | θ ) p ( x | y , θ ) {\displaystyle p(x,y|\theta )=p(y|\theta )p(x|y,\theta )} by using the chain rule. Each parameter vector θ {\displaystyle \theta } is associated with a decision function f θ ( x ) = argmax y p ( y | x , θ ) {\displaystyle f_{\theta }(x)={\underset {y}{\operatorname {argmax} }}\ p(y|x,\theta )} . The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by λ {\displaystyle \lambda } : argmax Θ ( log ⁡ p ( { x i , y i } i = 1 l | θ ) + λ log ⁡ p ( { x i } i = l + 1 l + u | θ ) ) {\displaystyle {\underset {\Theta }{\operatorname {argmax} }}\left(\log p(\{x_{i},y_{i}\}_{i=1}^{l}|\theta )+\lambda \log p(\{x_{i}\}_{i=l+1}^{l+u}|\theta )\right)} === Low-density separation === Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas support vector machines for supervised learning seek a decision boundary with maximal margin over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss ( 1 − y f ( x ) ) + {\displaystyle (1-yf(x))_{+}} for labeled data, a loss function ( 1 − | f ( x ) | ) + {\displaystyle (1-|f(x)|)_{+}} is introduced over the unlabeled data by letting y = sign ⁡ f ( x ) {\displaystyle y=\operatorname {sign} {f(x)}} . TSVM then selects f ∗ ( x ) = h ∗ ( x ) + b {\displaystyle f^{}(x)=h^{}(x)+b} from a reproducing kernel Hilbert space H {\displaystyle {\mathcal {H}}} by minimizing the regularized empirical risk: f ∗ = argmin f ( ∑ i = 1 l ( 1 − y i f ( x i ) ) + + λ 1 ‖ h ‖ H 2 + λ 2 ∑ i = l + 1 l + u ( 1 − | f ( x i ) | ) + ) {\displaystyle f^{}={\underset {f}{\operatorname {argmin} }}\left(\displaystyle \sum _{i=1}^{l}(1-y_{i}f(x_{i}))_{+}+\lambda _{1}\|h\|_{\mathcal {H}}^{2}+\lambda _{2}\sum _{i=l+1}^{l+u}(1-|f(x_{i})|)_{+}\right)} An exact solution is intractable due to the non-convex term ( 1 − | f ( x ) | ) + {\displayst

    Read more →
  • Brian D. Ripley

    Brian D. Ripley

    Brian David Ripley FRSE (born 29 April 1952) is a British statistician. From 1990, he was professor of applied statistics at the University of Oxford and also a professorial fellow at St Peter's College. He retired August 2014 due to ill health. == Biography == Ripley has made contributions to the fields of spatial statistics and pattern recognition. His work on artificial neural networks in the 1990s helped to bring aspects of machine learning and data mining to the attention of statistical audiences. He emphasised the value of robust statistics in his books Pattern Recognition and Neural Networks and Modern Applied Statistics with S. Ripley helped develop the S-PLUS programming language and its open source derivative R. He co-authored two books based on S, S Programming and Modern Applied Statistics with S. Since mid-1997 he is a member of the "R Core Team" and from 2000 to 2021 he was one of the most active committers to the R core. The package MASS is one of only fifteen "recommended packages" for R (with June 2024 more than 20,900). He was educated at the University of Cambridge, where he was awarded both the Smith's Prize (at the time awarded to the best graduate essay writer who had been undergraduate at Cambridge in that cohort) and the Rollo Davidson Prize. The university also awarded him the Adams Prize in 1987 for an essay entitled Statistical Inference for Spatial Processes, later published as a book. He served on the faculty of Imperial College, London from 1976 until 1983, at which point he moved to the University of Strathclyde. == Authored books == Ripley, B. D. (1981) Spatial Statistics. Wiley, 252pp. ISBN 0-471-08367-4. Ripley, B. D. (1983) Stochastic Simulation. Wiley, ISBN 0-471-81884-4. Ripley, B. D. (1988). Statistical Inference for Spatial Processes. Cambridge University Press. ISBN 0-521-35234-7. Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge University Press. 403 pages. ISBN 0-521-46086-7. Venables, W. N. and Ripley, B. D. (2000) S Programming. Springer, 264pp. ISBN 978-0-387-98966-2. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S (Fourth Edition; previous editions published as Modern Applied Statistics with S-PLUS in 1994, 1997 & 1999). Springer, 462pp. ISBN 978-0-387-95457-8.

    Read more →
  • How to Choose an AI Code Generator

    How to Choose an AI Code Generator

    Shopping for the best AI code generator? An AI code generator is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI code generator slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →