List of datasets for machine-learning research

List of datasets for machine-learning research

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce. Many organizations, including governments, publish and share their datasets, often using common metadata formats (such as Croissant). The datasets are classified, based on the licenses, into two groups: open data and non-open data. The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes. == List of sorting used for datasets == The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions. == List of open data portals == == List of portals suitable for multiple types of applications == The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications. == List of portals suitable for a specific subtype of applications == The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections. == Image data == == Text data == These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis. === Reviews === === News articles === === Messages === === Twitter and tweets === === Dialogues === === Legal === === Other text === == Sound data == These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis. === Speech === === Music === === Other sounds === == Signal data == Datasets containing electric signal information requiring some sort of signal processing for further analysis. === Electrical === === Motion-tracking === === Other signals === == Chemical data == Datasets from physical systems. === Chemical Reactions with transition states (TS) === === OpenReACT-CHON-EFH === OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials. RTP set – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level. IRC set – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points. NMS set – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations. The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein. The dataset itself is distributed under a CC licence via Figshare. == Physical data == Datasets from physical systems. === High-energy physics === === Systems === === Astronomy === === Earth science === === Other physical === == Biological data == Datasets from biological systems. === Human === === Animal === === Fungi === === Plant === === Microbe === === Drug discovery === == Anomaly data == == Question answering data == This section includes datasets that deals with structured data. == Dialog or instruction prompted data == This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request. == Cybersecurity == == Climate and sustainability == == Code data == == Multivariate data == === Financial === === Weather === === Census === === Transit === === Internet === === Games === === Other multivariate === == Curated repositories of datasets == As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.

LIVAC Synchronous Corpus

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations. The "Windows" approach is the most innovative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible. Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) search and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 25 years, based on the above mentioned variables of location, time and subject. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Newsmaker Rosters). Notable among these are the decades long periodic reviews of the 25 years of annual pan-Chinese rosters since 2000 and compilation of new word databases (LIVAC Annual Pan-Chinese New Word Rosters). On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible. A recent focus is on the relative balance between disyllabic words and growing trisyllabic words in the Chinese language, and the comparative study of light verbs in three Chinese speech communities. as well as the link between the language use and use of language as a reflection of epochal change in China. A new LIVAC version 3.1 was launched in February 2024. == Corpus data processing == Accessing media texts, manual input, etc. Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions Automatic word segmentation Automatic alignment of parallel texts Manual verification, part-of-speech tagging Extraction of words and addition to regional sub-corpora Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database == Labeling for data curation == Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc. Construction of databases of proper names, place names, and specific terms, etc. Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc. == Applications == Compilation of Pan-Chinese dictionaries or local dictionaries Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining Comparative studies on linguistic and cultural developments in the Pan-Chinese regions, especially in a critical period of history in modern China. Language teaching and learning research, and speech-to-text conversion Customized service on linguistic research and lexical search for international corporations and government agencies The above applications are provided by the following functions: Word Segmentation Search Phrase Search Example Sentence Selection Multi-word Comparison Word Cloud

Karl Steinbuch

Karl W. Steinbuch (June 15, 1917 in Stuttgart-Bad Cannstatt – June 4, 2005 in Ettlingen) was a German computer scientist, cyberneticist, and electrical engineer. He was an early and influential researcher in German computer science, and was the developer of the Lernmatrix, an early implementation of artificial neural networks. From the late 1960s onwards the focus of his activity shifted from scientific research to right-wing political activism supporting the Neue Rechte. == Biography == Steinbuch joined the National Socialist German Students' League (NSDStB) and the Nazi Party. Steinbuch studied at the University of Stuttgart and in 1944 he received his PhD in physics. In 1948 he joined Standard Elektrik Lorenz (SEL, part of the ITT group) in Stuttgart, as a computer design engineer and later as a director of research and development, where he filed more than 70 patents. Steinbuch completed the first European fully transistorized computer, the ER 56 marketed by SEL. In 1958 he became professor and director of the Institute of Technology for information processing (ITIV) of the University of Karlsruhe, where he retired in 1980. In 1967 he began publishing books, in which he tried to influence German education policy. Together with books from colleagues like Jean Ziegler from Switzerland, Eric J. Hobsbawm from the UK, and John Naisbitt his books predicted what he regarded as the coming education disaster of the emerging civic lobby society. In 1957, together with Helmut Gröttrup, Steinbuch coined the term Informatik, the German word for computer science, which gave informatics, and the term kybernetische Anthropologie. == Awards and recognition == Wilhelm-Boelsche award - medal in Gold German non-fiction book award Gold medal award of the XXI. International Congresses on Aerospace Medicine Konrad Adenauer award of science Jakob Fugger award medal Medal of merit of the state of Baden-Wuerttemberg member, German Academy of Sciences Leopoldina member, International Academy of Science, Munich. grants from a state government grants program, named "Karl-Steinbuch-Stipendium" Steinbuch Centre for Computing at the Karlsruhe Institute of Technology named after him == Books == Steinbuch wrote several books and articles, including: 1957 Informatik: Automatische Informationsverarbeitung ("Informatics: automatic information processing"). 1963 Learning matrices and their applications (together with U. A. W. Piske) 1965 A critical comparison of two kinds of adaptive classification networks (together with Bernard Widrow) 1966 (1969): Die informierte Gesellschaft. Geschichte und Zukunft der Nachrichtentechnik (The informed society. History and Future of telecommunications) 1989: Die desinformierte Gesellschaft (The disinformed society) 1968: Falsch programmiert. Über das Versagen unserer Gesellschaft in der Gegenwart und vor der Zukunft und was eigentlich geschehen müßte. (as a bestseller listet in: Der Spiegel) (Programmed falsely. About our society's failure in the present and with respect to the future and what should be done.) 1969: Programm 2000. (as a bestseller listet in: Der Spiegel) 1971: Automat und Mensch. Auf dem Weg zu einer kybernetischen Anthropologie (Machine and Man. On the way to a cybernetic anthropology; 4th revised edition) 1971: Mensch Technik Zukunft. Probleme von Morgen (German non-fiction book award) (Man Technology Future. Problems of Tomorrow) 1973: Kurskorrektur (Correcting the Course) 1978: Maßlos informiert. Die Enteignung des Denkens (Excessively informed. The Deprivation of Thinking) 1984: Unsere manipulierte Demokratie. Müssen wir mit der linken Lüge leben? (Our Thought-controlled Democracy. Do we have to live with the leftist lie?)

Phraselator

The Phraselator is a weatherproof handheld language translation device developed by Applied Data Systems and VoxTec, a former division of the military contractor Marine Acoustics, located in Annapolis, Maryland, USA. It was designed to serve as a handheld computer device that translates English into one of 40 different languages. == The device == The Phraselator is a small speech translation PDA-sized device designed to aid in interpretation. The device does not produce synthesized speech like that utilized by Stephen Hawking; instead, it plays pre-recorded foreign language MP3 files. Users can select the phrase they wish to convey from an English list on the screen or speak into the device. It then uses speech recognition technology called DynaSpeak, developed by SRI International, to play the proper sound file. The accuracy of the speech recognition software is over 70 percent according to software developer Jack Buchanan. The device can also record replies for translation later. Pre-recorded phrases are stored on Secure Digital flash memory cards. A 128 MB card can hold up to 12,000 phrases in four or five languages. Users can download phrase modules from the official website, which contained over 300,000 phrases as of March 2005. Users can also construct their own custom phrase modules. Earlier devices were known to have run on an SA-1110 Strong Arm 206 MHz CPU with 32MB SDRAM and 32MB onboard Flash RAM. A newer model, the P2, was released in 2004 and developed according to feedback from U.S. soldiers. It translates one way from English to approximately 60 other languages. It has a directional microphone, a larger library of phrases and a longer battery life. The 2004 release was created by and utilizes a computer board manufactured by InHand Electronics, Inc. In the future, the device will be able to display pictures so users can ask questions such as "Have you seen this person?" Developer Ace Sarich notes that the device is inferior to human interpreter. Conclusions derived from a Nepal field test conducted by U.S. and Nepal based NGO Himalayan Aid in 2004 seemed to confirm Sarich's comparisons: The very concept of using a machine as a communication point between individuals seemed to actually encourage a more limited form of interaction between tester and respondent. Usually, when limited language skills are present between parties, the genuine struggle and desire to communicate acts as a display of good will – we openly display our weakness in this regard – and the result is a more relaxed and human encounter. This was not necessarily present with the Phraselator as all parties abandoned learning about each other and instead focused on learning how to work with the device. As a tool for bridging any cultural differences or communicating effectively at any length, the Phraselator would not be recommended. This device, at least in the form tested, would best be used in large-scale operations where there is no time for language training and there is a need to communicate fixed ideas, quickly, over the greatest distance by employing large amounts of unskilled users. Large humanitarian or natural disasters in remote areas of third-world countries might be an effective example. == Origin == The original idea for the device came from Lee Morin, a Navy doctor in Operation Desert Storm. To communicate with patients, he played Arabic audio files from his laptop. He informed Ace Sarich, the vice president of VoxTec, about the idea. VoxTec won a DARPA Small Business Innovation Research grant in early 2001 to develop a military-grade handheld phrase translator. During its development, the Phraselator was tested and evaluated by scientists from the Army Research Laboratory. The device was first field tested in Afghanistan in 2001. By 2002, about 500 Phraselators were built for soldiers around the world with another 250 ordered by the U.S. Special Forces. The device cost $2000 to develop and could convert spoken English into one of 200,000 recorded commands and questions in 30 languages. However, the device could only translate one-way. At the time, the only existing two-way voice translator that could convert speech back and forth between languages was the Audio Voice Translation Guide System, or TONGUES, which was developed by Carnegie Mellon University for Lockheed Martin. As part of a DARPA program known as the Spoken Language Communication and Translation System for Tactical Use, SRI International has further developed two-way translation software for use in Iraq called IraqComm in 2006 which contains a vocabulary of 40,000 English words and 50,000 words in Iraqi Arabic. == Notable users == The handheld translator was recently used by U.S. troops while providing relief to tsunami victims in early 2005. About 500 prototypes of the device were provided to U.S. military forces in Operation Enduring Freedom. Units loaded with Haitian dialects have been provided to U.S. troops in Haiti. Army military police have used it in Kandahar to communicate with POWs. In late 2004, the U.S. Navy began to augment some ships with a version of the device attached to large speakers in order to broadcast clear voice instructions up to 400 yards (370 m) away. Corrections officers and law enforcement in Oneida County, New York, have tested the device. Hospital emergency rooms and health departments have also evaluated it. Several Native American tribes such as the Choctaw Nation, the Ponca, and the Comanche Nation have also used the device to preserve their dying languages. Various law enforcement agencies, such as the Los Angeles Police Department, also use the phraselator in their patrol cars. == Awards == In March 2004, DARPA director Dr. Tony Tether presented the Small Business Innovative Research Award to the VoxTec division of Marine Acoustics at DARPATech 2004 in Anaheim, CA. The device was recently listed as one of "Ten Emerging Technologies That Will Change Your World" in MIT's Technology Review. == Pop culture == Software developer Jack Buchanan believes that building a device similar to the fictional universal translator seen in Star Trek would be harder than building the Enterprise. The device was mentioned in a list of "Top 10 Star Trek Tech" on Space.com.

Self-verifying finite automaton

In automata theory, a self-verifying finite automaton (SVFA) is a special kind of a nondeterministic finite automaton (NFA) with a symmetric kind of nondeterminism introduced by Hromkovič and Schnitger. Generally, in self-verifying nondeterminism, each computation path is concluded with any of the three possible answers: yes, no, and I do not know. For each input string, no two paths may give contradictory answers, namely both answers yes and no on the same input are not possible. At least one path must give answer yes or no, and if it is yes then the string is considered accepted. SVFA accept the same class of languages as deterministic finite automata (DFA) and NFA but have different state complexity. == Formal definition == An SVFA is represented formally by a 6-tuple, A=(Q, Σ, Δ, q0, Fa, Fr) such that (Q, Σ, Δ, q0, Fa) is an NFA, and Fa, Fr are disjoint subsets of Q. For each word w = a1a2 … an, a computation is a sequence of states r0,r1, …, rn, in Q with the following conditions: r0 = q0 ri+1 ∈ Δ(ri, ai+1), for i = 0, …, n−1. If rn ∈ Fa then the computation is accepting, and if rn ∈ Fr then the computation is rejecting. There is a requirement that for each w there is at least one accepting computation or at least one rejecting computation but not both. == Results == Each DFA is a SVFA, but not vice versa. Jirásková and Pighizzini proved that for every SVFA of n states, there exists an equivalent DFA of g ( n ) = Θ ( 3 n / 3 ) {\displaystyle g(n)=\Theta (3^{n/3})} states. Furthermore, for each positive integer n, there exists an n-state SVFA such that the minimal equivalent DFA has exactly g ( n ) {\displaystyle g(n)} states. Other results on the state complexity of SVFA were obtained by Jirásková and her colleagues.

Charge-coupled device

A charge-coupled device (CCD) is an integrated circuit containing an array of linked, or coupled, capacitors. Under the control of an external circuit, each capacitor can transfer its electric charge to a neighboring capacitor. CCD sensors are a major technology used in digital imaging. In a CCD image sensor, pixels are represented by p-doped metal–oxide–semiconductor (MOS) capacitors. These MOS capacitors, the basic building blocks of a CCD, are biased above the threshold for inversion when image acquisition begins, allowing the conversion of incoming photons into electron charges at the semiconductor-oxide interface; the CCD is then used to read out these charges. Although CCDs are not the only technology to allow for light detection, CCD image sensors are widely used in professional, medical, and scientific applications where high-quality image data are required. In applications with less exacting quality demands, such as consumer and professional digital cameras, active pixel sensors, also known as CMOS sensors (complementary MOS sensors), are generally used. However, the large quality advantage CCDs enjoyed early on has narrowed over time and since the late 2010s CMOS sensors are the dominant technology, having largely if not completely replaced CCD image sensors. == History == The basis for the CCD is the metal–oxide–semiconductor (MOS) structure, with MOS capacitors being the basic building blocks of a CCD, and a depleted MOS structure used as the photodetector in early CCD devices. In the late 1960s, Willard Boyle and George E. Smith at Bell Labs were researching MOS technology while working on semiconductor bubble memory. They realized that an electric charge was the analog of the magnetic bubble and that it could be stored on a tiny MOS capacitor. As it was fairly straightforward to fabricate a series of MOS capacitors in a row, they connected a suitable voltage to them so that the charge could be stepped along from one to the next. This led to the invention of the charge-coupled device by Boyle and Smith in 1969. They conceived of the design of what they termed, in their notebook, "Charge 'Bubble' Devices". The initial paper describing the concept in April 1970 listed possible uses as memory, a delay line, and an imaging device. The device could also be used as a shift register. The essence of the design was the ability to transfer charge along the surface of a semiconductor from one storage capacitor to the next. The first experimental device demonstrating the principle was a row of closely spaced metal squares on an oxidized silicon surface electrically accessed by wire bonds. It was demonstrated by Gil Amelio, Michael Francis Tompsett and George Smith in April 1970. This was the first experimental application of the CCD in image sensor technology, and used a depleted MOS structure as the photodetector. The first patent (U.S. patent 4,085,456) on the application of CCDs to imaging was assigned to Tompsett, who filed the application in 1971. The first working CCD made with integrated circuit technology was a simple 8-bit shift register, reported by Tompsett, Amelio and Smith in August 1970. This device had input and output circuits and was used to demonstrate its use as a shift register and as a crude eight pixel linear imaging device. Development of the device progressed at a rapid rate. By 1971, Bell researchers led by Michael Tompsett were able to capture images with simple linear devices. Several companies, including Fairchild Semiconductor, RCA and Texas Instruments, picked up on the invention and began development programs. Fairchild's effort, led by ex-Bell researcher Gil Amelio, was the first with commercial devices, and by 1974 had a linear 500-element device and a 2D 100 × 100 pixel device. Peter L. P. Dillon, a scientist at Kodak Research Labs, invented the first color CCD image sensor by overlaying a color filter array on this Fairchild 100 x 100 pixel Interline CCD starting in 1974. Steven Sasson, an electrical engineer working for the Kodak Apparatus Division, invented a digital still camera using this same Fairchild 100 × 100 CCD in 1975. The interline transfer (ILT) CCD device was proposed by L. Walsh and R. Dyck at Fairchild in 1973 to reduce smear and eliminate a mechanical shutter. To further reduce smear from bright light sources, the frame-interline-transfer (FIT) CCD architecture was developed by K. Horii, T. Kuroda and T. Kunii at Matsushita (now Panasonic) in 1981. The first KH-11 KENNEN reconnaissance satellite equipped with charge-coupled device array (800 × 800 pixels) technology for imaging was launched in December 1976. Under the leadership of Kazuo Iwama, Sony started a large development effort on CCDs involving a significant investment. Eventually, Sony managed to mass-produce CCDs for their camcorders. Before this happened, Iwama died in August 1982. Subsequently, a CCD chip was placed on his tombstone to acknowledge his contribution. The first mass-produced consumer CCD video camera, the CCD-G5, was released by Sony in 1983, based on a prototype developed by Yoshiaki Hagiwara in 1981. Early CCD sensors suffered from shutter lag. This was largely resolved with the invention of the pinned photodiode (PPD). It was invented by Nobukazu Teranishi, Hiromitsu Shiraki and Yasuo Ishihara at NEC in 1980. They recognized that lag can be eliminated if the signal carriers could be transferred from the photodiode to the CCD. This led to their invention of the pinned photodiode, a photodetector structure with low lag, low noise, high quantum efficiency and low dark current. It was first publicly reported by Teranishi and Ishihara with A. Kohono, E. Oda and K. Arai in 1982, with the addition of an anti-blooming structure. The new photodetector structure invented at NEC was given the name "pinned photodiode" (PPD) by B.C. Burkey at Kodak in 1984. In 1987, the PPD began to be incorporated into most CCD devices, becoming a fixture in consumer electronic video cameras and then digital still cameras. Since then, the PPD has been used in nearly all CCD sensors and then CMOS sensors. In January 2006, Boyle and Smith were awarded the National Academy of Engineering Charles Stark Draper Prize, and in 2009 they were awarded the Nobel Prize for Physics for their invention of the CCD concept. Michael Tompsett was awarded the 2010 National Medal of Technology and Innovation, for pioneering work and electronic technologies including the design and development of the first CCD imagers. He was also awarded the 2012 IEEE Edison Medal for "pioneering contributions to imaging devices including CCD Imagers, cameras and thermal imagers". == Basics of operation == In a CCD for capturing images, there is a photoactive region (an epitaxial layer of silicon), and a transmission region made out of a shift register (the CCD, properly speaking). An image is projected through a lens onto the capacitor array (the photoactive region), causing each capacitor to accumulate an electric charge proportional to the light intensity at that location. A one-dimensional array, used in line-scan cameras, captures a single slice of the image, whereas a two-dimensional array, used in video and still cameras, captures a two-dimensional picture corresponding to the scene projected onto the focal plane of the sensor. Once the array has been exposed to the image, a control circuit causes each capacitor to transfer its contents to its neighbor (operating as a shift register). The last capacitor in the array dumps its charge into a charge amplifier, which converts the charge into a voltage. By repeating this process, the controlling circuit converts the entire contents of the array in the semiconductor to a sequence of voltages. In a digital device, these voltages are then sampled, digitized, and usually stored in memory; in an analog device (such as an analog video camera), they are processed into a continuous analog signal (e.g. by feeding the output of the charge amplifier into a low-pass filter), which is then processed and fed out to other circuits for transmission, recording, or other processing. == Detailed physics of operation == === Charge generation === Before the MOS capacitors are exposed to light, they are biased into the depletion region; in n-channel CCDs, the silicon under the bias gate is slightly p-doped or intrinsic. The gate is then biased at a positive potential, above the threshold for strong inversion, which will eventually result in the creation of an n channel below the gate as in a MOSFET. However, it takes time to reach this thermal equilibrium: up to hours in high-end scientific cameras cooled at low temperature. Initially after biasing, the holes are pushed far into the substrate, and no mobile electrons are at or near the surface; the CCD thus operates in a non-equilibrium state called deep depletion. Then, when electron–hole pairs are generated in the depletion region, they are separated by the electric field, the elec

Simon Godsill

Simon John Godsill (born 2 December 1965) is professor of statistical signal processing at the University of Cambridge, and a professorial fellow at Corpus Christi College. He is also a member of the Centre for Science and Policy. His main area of research is Bayesian statistics and stochastic sampling methodologies, particularly particle filtering. == Education == Godsill obtained both undergraduate and Ph.D. degrees from the Department of Engineering at Cambridge University, whilst a member of Selwyn College. He obtained a first class degree in the Electrical and Information Sciences Tripos. The title of his 1993 Ph.D. thesis was "The Restoration of Degraded Audio Signals" and his Ph.D. supervisor was Peter Rayner, whom he shared with Michael Richard Lynch. == Career == Godsill has published over 250 articles in peer reviewed journals, along with the books Digital audio restoration: a statistical model based approach and Compressed sensing & sparse filtering. == Business interests == Godsill is currently a director of CEDAR Audio Ltd, a Cambridge-based company that applies Bayesian mathematics for purposes of noise reduction in audio data. In February 2005, the company received a Sci-Tech Academy Award (a 'Technical Oscar') for its services to the movie industry, and a stream of innovations appeared over the following years with corresponding recognition including induction into the Audio Technology Hall of Fame (2008), a Cinema Audio Society Award (2009). Godsill is also a director at Input Dynamics Ltd, a Cambridge-based company that applies Bayesian techniques to touch screen technology. Godsill is involved with the research effort at BMLL Technologies, a Cambridge spin-off working in the field of machine learning application in the financial sector.