AI Data Jobs

AI Data Jobs — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Patent visualisation

    Patent visualisation

    Patent visualisation is an application of information visualisation. The number of patents has been increasing, encouraging companies to consider intellectual property as a part of their strategy. Patent visualisation, like patent mapping, is used to quickly view a patent portfolio. Software dedicated to patent visualisation began to appear in 2000, for example Aureka from Aurigin (now owned by Thomson Reuters). Many patent and portfolio analytics platforms, such as Questel, Patent Forecast, PatSnap, Patentcloud, Relecura, and Patent iNSIGHT Pro, offer options to visualise specific data within patent documents by creating topic maps, priority maps, IP Landscape reports, etc. Software converts patents into infographics or maps, to allow the analyst to "get insight into the data" and draw conclusions. Also called patinformatics, it is the "science of analysing patent information to discover relationships and trends that would be difficult to see when working with patent documents on a one-and-one basis". Patents contain structured data (like publication numbers) and unstructured text (like title, abstract, claims and visual info). Structured data are processed by data-mining and unstructured data are processed with text-mining. == Data mining == The main step in processing structured information is data-mining, which emerged in the late 1980s. Data mining involves statistics, artificial intelligence, and machine learning. Patent data mining extracts information from the structured data of the patent document. These structured data are bibliographic fields such as location, date or status. === Structured fields === === Advantages === Data mining allows study of filing patterns of competitors and locates main patent filers within a specific area of technology. This approach can be helpful to monitor competitors' environments, moves and innovation trends and gives a macro view of a technology status. == Text-mining == === Principle === Text mining is used to search through unstructured text documents. This technique is widely used on the Internet, it has had success in bioinformatics and now in the intellectual property environment. Text mining is based on a statistical analysis of word recurrence in a corpus. An algorithm extracts words and expressions from title, summary and claims and gathers them by declension. "And" and "if" are labeled as non-information bearing words and are stored in the stopword list. Stoplists can be specialised in order to create an accurate analysis. Next, the algorithm ranks the words by weight, according to their frequency in the patent's corpus and the document frequency containing this word. The score for each word is calculated using a formula such as: W e i g h t = T e r m F r e q u e n c y D o c u m e n t F r e q u e n c y = F r e q u e n c y o f t h e w o r d o r e x p r e s s i o n i n t h e T e x t S e a N u m b e r o f d o c u m e n t s c o n t a i n i n g t h e e x p r e s s i o n o r w o r d {\displaystyle Weight={\frac {Term\ Frequency}{Document\ Frequency}}={\frac {Frequency\ of\ the\ word\ or\ expression\ in\ the\ Text\ Sea}{Number\ of\ documents\ containing\ the\ expression\ or\ word}}} A frequently used word in several documents has less weight than a word used frequently in a few patents. Words under a minimum weight are eliminated, leaving a list of pertinent words or descriptors. Each patent is associated to the descriptors found in the selected document. Further, in the process of clusterisation, these descriptors are used as subsets, in which the patent are regrouped or as tags to place the patents in predetermined categories, for example keywords from International Patent Classifications. Four text parts can be processed with text-mining : Title Abstract Claim Patent Full-Text Software offer different combinations but title, abstract and claim are generally the most used, providing a good balance between interferences and relevancy. === Advantages === Text-mining can be used to narrow a search or quickly evaluate a patent corpus. For instance, if a query produces irrelevant documents, a multi-level clustering hierarchy identifies them in order to delete them and refine the search. Text-mining can also be used to create internal taxonomies specific to a corpus for possible mapping. == Visualisations == Allying patent analysis and informatic tools offers an overview of the environment through value-added visualisations. As patents contain structured and unstructured information, visualisations fall in two categories. Structured data can be rendered with data mining in macrothematic maps and statistical analysis. Unstructured information can be shown in like clouds, cluster maps and 2D keyword maps. === Data mining visualisation === === Text mining visualisation === === Visualisation for both data-mining and text-mining === Mapping visualisations can be used for both text-mining and data-mining results. == Uses == What patent visualisation can highlight: Competitors Partners New innovations Technologic environment description Networks Field application: R&D strategy management Competitive intelligence Licensing Strategy

    Read more →
  • Oblivion (2013 film)

    Oblivion (2013 film)

    Oblivion is a 2013 American epic post-apocalyptic science fiction action film produced and directed by Joseph Kosinski from a screenplay by Karl Gajdusek and Michael deBruyn, starring Tom Cruise in the main role alongside Morgan Freeman, Olga Kurylenko, Andrea Riseborough, Nikolaj Coster-Waldau, and Melissa Leo in supporting roles. Based on Kosinski's unpublished Radical Comics graphic novel of the same name, the film pays homage to 1970s sci-fi, and is a "love story" set in 2077 on an Earth desolated by an alien war; a maintenance technician on the verge of completing his mission finds a woman who survived from a space ship crash, leading him to question his purpose and discover the truth about the war. Oblivion premiered in Buenos Aires on March 26, 2013, and was released in theaters by Universal Pictures on April 19. The film grossed $286 million worldwide on a production budget of $120 million and received mixed reviews from critics. == Plot == In 2017, aliens known as Scavengers attack Earth and destroy the Moon, triggering global natural disasters. Although humanity wins the war using nuclear weapons, Earth is left uninhabitable. Sixty years later, the remnants of humanity have relocated to a colony on Saturn's moon Titan, except for Unit 49—technician Jack and his communications officer Victoria—who are scheduled to join them in two weeks. The pair oversee hydro rigs that convert seawater into fusion energy for the Tet, the last remaining human colony ship in orbit. Though Jack and Victoria are romantically involved and have had their memories erased for security reasons, Jack experiences recurring dreams of an unknown woman. He also secretly visits a hidden, verdant valley where he has built a lakeside cabin and collects relics of Earth's past. While investigating a missing drone—autonomous, highly advanced, and heavily armed machines—Jack is nearly captured by Scavengers. Later, he discovers the Scavengers are transmitting a signal into space. A NASA pod crash-lands at the signal's coordinates, carrying five humans in suspended animation, including the woman from Jack's dreams. A drone arrives and destroys four of the pods, but Jack rescues the remaining one and brings the unconscious woman to Unit 49's base. After reviving her, Jack and Victoria learn that the woman, Julia, has been in stasis aboard the Odyssey spaceship since 2017. Julia insists on recovering the ship's flight recorder. However, she and Jack are captured by Scavengers and brought to the Raven Rock Mountain Complex. Their leader, Malcolm, reveals that the Scavengers are actually surviving humans. Malcolm needs Jack to reprogram a captured drone to deliver a nuclear bomb, built from Odyssey's reactor, to the Tet. Jack refuses, so Malcolm releases him and Julia, urging him to seek the truth in the radiation zone, which is supposedly deadly and off-limits. Julia helps Jack recall that she is his wife, and fragments of his memories begin to return. When they arrive back at Unit 49, a devastated Victoria informs Sally, the Tet's mission controller, that she and Jack are no longer an "effective team." A drone activates and kills Victoria. Jack and Julia destroy the drone, but crash their aircraft inside the radiation zone. There, they encounter another version of Jack—"Jack-52"—who arrives to repair the drone. Jack subdues him, but Julia is seriously injured in the fight. Jack impersonates his clone to infiltrate Unit 52, meets Victoria-52, and steals medical supplies for Julia. They rest at his cabin. At Raven Rock, Malcolm reveals the truth: humanity lost the war, and the Tet is an alien machine intelligence harvesting Earth's resources. After the Moon's destruction, the Tet deployed thousands of clones of astronaut Jack Harper—brainwashed into obedience—to exterminate the remaining humans. Malcolm had assumed these clones were inhuman until witnessing Jack show interest in a discarded book, hinting at lingering humanity. Jack reprograms the captured drone, but it is destroyed in a surprise attack by other drones, leaving Malcolm badly wounded. Jack and Julia resolve to deliver the bomb themselves; Julia enters a stasis pod. En route, Jack listens to the Odyssey's flight recorder, which reveals the original Jack Harper and Victoria were astronauts sent to explore Titan before being confronted by the Tet. The pair were captured, but not before Jack ejected the remaining crew—including Julia—in stasis pods to protect them. Jack gains access to the Tet by claiming he is delivering Julia, as previously instructed. However, the stasis pod contains a dying Malcolm. Jack and Malcolm detonate the bomb, destroying the Tet and themselves. Julia later awakens at the cabin. Three years later, Julia lives there and it is revealed she had a daughter with Jack. A group of Raven Rock survivors arrives, alongside Jack-52, who has begun regaining fragments of his own lost identity. == Cast == Tom Cruise as Jack Harper—Tech 49, a technician who works to repair drones on Earth and questions his mission. Originally, he was the American commander of a mission en route to Titan who was captured by the Tet and cloned to fight humanity. Cruise also plays Jack Harper—Tech 52, a clone who seeks out Julia after the destruction of the Tet. Morgan Freeman as Malcolm Beech, an American veteran soldier and leader of a large community of scavengers, the human survivors of the alien Tet's attacks. Olga Kurylenko as Julia Rusakova Harper, Jack's wife and a Russian crew member on the Odyssey, who was sent back towards Earth by her husband to protect her from the initial contact with the Tet. Andrea Riseborough as Victoria "Vika" Olsen, Jack's communications partner and housemate. Originally, she was the British co-pilot of Jack's mission to Titan who was captured and cloned to assist in the Tet's war on humanity. Riseborough also plays a clone of Vika who Jack misleads to obtain medical supplies. Nikolaj Coster-Waldau as Sergeant Sykes, the main military commander of Beech's community of scavengers who is skeptical of Jack at first. Melissa Leo as the Tet, an alien artificial intelligence seeking to acquire Earth's natural resources and wipe out humanity. Leo also plays Sally, the mission director of Jack and Julia's mission to Titan; her likeness was copied by the Tet to serve as its visual and auditory representation. Zoë Bell as Kara, a soldier and member of the scavengers. == Production == === Development === Joseph Kosinski started the movie process by beginning work on a graphic novel called Oblivion featuring his story. While the completion of this would be teased to the public and the concept was used to pitch the movie, it was never finished and Kosinski claims he never intended to, stating it was "just a stage in the project [of film development]". Arvid Nelson was billed as co-writer and Radical Comics was attached as publisher. The novel was never finished; Kosinski explaining: "the partnership with Radical Comics allowed me to continue working on the story by developing a series of images and continuing to refine the story more over a period of years. Then I basically used all that development as a pitch kit to the studio. So even though we really never released it as an illustrated novel the story is being told as a film, which was always the intention." Walt Disney Pictures, which produced Kosinski's previous film Tron: Legacy (2010), acquired the Oblivion film adaptation rights from Radical Comics and Kosinski after a heated auction in August 2010. The film was a directing vehicle for Kosinski, with Barry Levine producing, and Jesse Berger executive producing. Other studios that made bids on the film were Paramount Pictures, 20th Century Fox, and Universal Pictures. Disney subsequently released the rights after realizing the PG-rated film they envisioned, in line with their family-oriented reputation, would require too many story changes. Universal, which had also bid for the original rights, then bought them from Kosinski and Radical and authorized a PG-13 film version. The film's script was originally written by Kosinski and William Monahan and underwent a first rewrite by Karl Gajdusek. When the film passed into Universal's hands, a final rewrite was done by Michael Arndt, under the pen name "Michael deBruyn". Universal was particularly appreciative of the script, saying, "It's one of the most beautiful scripts we've ever come across." The Bubble Ship operated by Cruise's main character, Jack 49, was inspired by the Bell 47 helicopter (often colloquially referred to as a "bubble cockpit" helicopter), a utilitarian 1947 vehicle with a transparent round canopy that Kosinski saw in the lobby of the Museum of Modern Art in Manhattan, and which he likened to a dragonfly. Daniel Simon, who previously worked with Kosinski as the lead vehicle designer on Tron: Legacy, was tasked with creating the Bubble Ship from this basis, incorporating elements evocative of an advanced fighter

    Read more →
  • Computer-automated design

    Computer-automated design

    Design Automation usually refers to electronic design automation, or Design Automation which is a Product Configurator. Extending Computer-Aided Design (CAD), automated design and Computer-Automated Design (CAutoD) are more concerned with a broader range of applications, such as automotive engineering, civil engineering, composite material design, control engineering, dynamic system identification and optimization, financial systems, industrial equipment, mechatronic systems, steel construction, structural optimisation, and the invention of novel systems. The concept of CAutoD perhaps first appeared in 1963, in the IBM Journal of Research and Development, where a computer program was written. to search for logic circuits having certain constraints on hardware design to evaluate these logics in terms of their discriminating ability over samples of the character set they are expected to recognize. More recently, traditional CAD simulation is seen to be transformed to CAutoD by biologically-inspired machine learning, including heuristic search techniques such as evolutionary computation, and swarm intelligence algorithms. == Guiding designs by performance improvements == To meet the ever-growing demand of quality and competitiveness, iterative physical prototyping is now often replaced by 'digital prototyping' of a 'good design', which aims to meet multiple objectives such as maximised output, energy efficiency, highest speed and cost-effectiveness. The design problem concerns both finding the best design within a known range (i.e., through 'learning' or 'optimisation') and finding a new and better design beyond the existing ones (i.e., through creation and invention). This is equivalent to a search problem in an almost certainly, multidimensional (multivariate), multi-modal space with a single (or weighted) objective or multiple objectives. == Normalized objective function: cost vs. fitness == Using single-objective CAutoD as an example, if the objective function, either as a cost function J ∈ [ 0 , ∞ ) {\displaystyle J\in [0,\infty )} , or inversely, as a fitness function f ∈ ( 0 , 1 ] {\displaystyle f\in (0,1]} , where f = J 1 + J {\displaystyle f={\tfrac {J}{1+J}}} , is differentiable under practical constraints in the multidimensional space, the design problem may be solved analytically. Finding the parameter sets that result in a zero first-order derivative and that satisfy the second-order derivative conditions would reveal all local optima. Then comparing the values of the performance index of all the local optima, together with those of all boundary parameter sets, would lead to the global optimum, whose corresponding 'parameter' set will thus represent the best design. However, in practice, the optimization usually involves multiple objectives and the matters involving derivatives are a lot more complex. == Dealing with practical objectives == In practice, the objective value may be noisy or even non-numerical, and hence its gradient information may be unreliable or unavailable. This is particularly true when the problem is multi-objective. At present, many designs and refinements are mainly made through a manual trial-and-error process with the help of a CAD simulation package. Usually, such a posteriori learning or adjustments need to be repeated many times until a ‘satisfactory’ or ‘optimal’ design emerges. == Exhaustive search == In theory, this adjustment process can be automated by computerised search, such as exhaustive search. As this is an exponential algorithm, it may not deliver solutions in practice within a limited period of time. == Search in polynomial time == One approach to virtual engineering and automated design is evolutionary computation such as evolutionary algorithms. === Evolutionary algorithms === To reduce the search time, the biologically-inspired evolutionary algorithm (EA) can be used instead, which is a (non-deterministic) polynomial algorithm. The EA based multi-objective "search team" can be interfaced with an existing CAD simulation package in a batch mode. The EA encodes the design parameters (encoding being necessary if some parameters are non-numerical) to refine multiple candidates through parallel and interactive search. In the search process, 'selection' is performed using 'survival of the fittest' a posteriori learning. To obtain the next 'generation' of possible solutions, some parameter values are exchanged between two candidates (by an operation called 'crossover') and new values introduced (by an operation called 'mutation'). This way, the evolutionary technique makes use of past trial information in a similarly intelligent manner to the human designer. The EA based optimal designs can start from the designer's existing design database, or from an initial generation of candidate designs obtained randomly. A number of finely evolved top-performing candidates will represent several automatically optimized digital prototypes. There are websites that demonstrate interactive evolutionary algorithms for design. allows you to evolve 3D objects online and have them 3D printed. allows you to do the same for 2D images.

    Read more →
  • Shyster (expert system)

    Shyster (expert system)

    SHYSTER is a legal expert system developed at the Australian National University in Canberra in 1993. It was written as the doctoral dissertation of James Popple under the supervision of Robin Stanton, Roger Clarke, Peter Drahos, and Malcolm Newey. A full technical report of the expert system, and a book further detailing its development and testing have also been published. SHYSTER emphasises its pragmatic approach, and posits that a legal expert system need not be based upon a complex model of legal reasoning in order to produce useful advice. Although SHYSTER attempts to model the way in which lawyers argue with cases, it does not attempt to model the way in which lawyers decide which cases to use in those arguments. SHYSTER is of a general design, permitting its operation in different legal domains. It was designed to provide advice in areas of case law that have been specified by a legal expert using a bespoke specification language. Its knowledge of the law is acquired, and represented, as information about cases. It produces its advice by examining, and arguing about, the similarities and differences between cases. It derives its name from Shyster: a slang word for someone who acts in a disreputable, unethical, or unscrupulous way, especially in the practice of law and politics. == Methods == SHYSTER is a specific example of a general category of legal expert systems, broadly defined as systems that make use of artificial intelligence (AI) techniques to solve legal problems. Legal AI systems can be divided into two categories: legal retrieval systems and legal analysis systems. SHYSTER belongs to the latter category of legal analysis systems. Legal analysis systems can be further subdivided into two categories: judgment machines and legal expert systems. SHYSTER again belongs to the latter category of legal expert systems. A legal expert system, as Popple uses the term, is a system capable of performing at a level expected of a lawyer: "AI systems which merely assist a lawyer in coming to legal conclusions or preparing legal arguments are not here considered to be legal expert systems; a legal expert system must exhibit some legal expertise itself." Designed to operate in more than one legal domain, and be of specific use to the common law of Australia, SHYSTER accounts for statute law, case law, and the doctrine of precedent in areas of private law. Whilst it accommodates statute law, it is primarily a case-based system, in contradistinction to rule-based systems like MYCIN. More specifically, it was designed in a manner enabling it to be linked with a rule-based system to form a hybrid system. Although case-based reasoning possesses an advantage over rule-based systems by the elimination of complex semantic networks, it suffers from intractable theoretical obstacles: without some further theory it cannot be predicted what features of a case will turn out to be relevant. Users of SHYSTER therefore require some legal expertise. Richard Susskind argues that "jurisprudence can and ought to supply the models of law and legal reasoning that are required for computerized [sic] implementation in the process of building all expert systems in law". Popple, however, believes jurisprudence is of limited value to developers of legal expert systems. He posits that a lawyer must have a model of the law (maybe unarticulated) which includes assumptions about the nature of law and legal reasoning, but that model need not rest on basic philosophical foundations. It may be a pragmatic model, developed through experience within the legal system. Many lawyers perform their work with little or no jurisprudential knowledge, and there is no evidence to suggest that they are worse, or better, at their jobs than lawyers well-versed in jurisprudence. The fact that many lawyers have mastered the process of legal reasoning, without having been immersed in jurisprudence, suggests that it may indeed be possible to develop legal expert systems of good quality without jurisprudential insight. As a pragmatic legal expert system SHYSTER is the embodiment of this belief. A further example of SHYSTER’s pragmatism is its simple knowledge representation structure. This structure was designed to facilitate specification of different areas of case law using a specification language. Areas of case law are specified in terms of the cases and attributes of importance in those areas. SHYSTER weights its attributes and checks for dependence between them. In order to choose cases upon which to construct its opinions, SHYSTER calculates distances between cases and uses these distances to determine which of the leading cases are nearest to the instant case. To this end SHYSTER can be seen to adopt and expand upon nearest neighbor search methods used in pattern recognition. These nearest cases are used to produce an argument (based on similarities and differences between the cases) about the likely outcome in the instant case. This argument relies on the doctrine of precedent; it assumes that the instant case will be decided the same way as was the nearest case. SHYSTER then uses information about these nearest cases to construct a report. The report that SHYSTER generates makes a prediction and justifies that prediction by reference only to cases and their similarities and differences: the calculations that SHYSTER performs in coming to its opinion do not appear in that opinion. Safeguards are employed to warn users if SHYSTER doubts the veracity of its advice. == Results == SHYSTER was tested in four different and disparate areas of case law. Four specifications were written, each representing an area of Australian law: an aspect of the law of trover; the meaning of "authorization [sic]" in copyright law of Australia; the categorisation of employment contracts; and the implication of natural justice in administrative decision-making. SHYSTER was evaluated under five headings: its usefulness, its generality, the quality of its advice, its limitations, and possible enhancements that could be made to it. Despite its simple knowledge representation structure, it has shown itself capable of producing good advice, and its simple structure has facilitated the specification of different areas of law. Appreciating the difficulties encountered by legal expert systems developers in adequately representing legal knowledge can assist in appreciating the shortcomings of digital rights management technologies. Some academics believe future digital rights management systems may become sophisticated enough to permit exceptions to copyright law. To this end SHYSTER's attempt to model "authorization [sic]" in the Copyright Act can be viewed as pioneering work in this field. The term "authorization [sic]" is undefined in the Copyright Act. Consequently, a number of cases have been before the courts seeking answers as to what conduct amounts to authorisation. The main contexts in which the issue has arisen are analogous to permitted exceptions to copyright currently prevented by most digital rights management technologies: "home taping of recorded materials, photocopying in educational institutions and performing works in public". When applied to one case concerning compact cassettes, SHYSTER successfully agreed that Amstrad did not authorise the infringement. 'shyster-myci'n Popple highlighted the most obvious avenue of future research using SHYSTER as the development of a rule-based system, and the linking together of that rule-based system with the existing case-based system to form a hybrid system. This intention was eventually realised by Thomas O’Callaghan, the creator of SHYSTER-MYCIN: a hybrid legal expert system first presented at ICAIL '03, 24–28 June 2003 in Edinburgh, Scotland. MYCIN is an existing medical expert system, which was adapted for use with SHYSTER. MYCIN’s controversial "certainty factor" is not used in SHYSTER-MYCIN. The reason for this is the difficulty in scientifically establishing how certain a fact is in a legal domain. The rule-based approach of the MYCIN part is used to reason with the provisions of an Act of Parliament only. This hybrid system enables the case-based system (SHYSTER) to determine open textured concepts when required by the rule-based system (MYCIN). The ultimate conclusion of this joint endeavour is that a hybrid approach is preferred in the creation of legal expert systems where "it is appropriate to use rule-based reasoning when dealing with statutes, and…case-based reasoning when dealing with cases".

    Read more →
  • Linux color management

    Linux color management

    Linux color management has the same goal as the color management systems (CMS) for other operating systems, which is to achieve the best possible color reproduction throughout an imaging workflow from its source (camera, video, scanner, etc.), through imaging software (Digikam, darktable, RawTherapee, GIMP, Krita, Scribus, etc.), and finally onto an output medium (monitor, video projector, printer, etc.). In particular, color management attempts to enable color consistency across media and throughout a color-managed workflow. Linux color management relies on the use of accurate ICC (International Color Consortium) and DCP (DNG Color Profile) profiles describing the behavior of input and output devices, and color-managed applications that are aware of these profiles. These applications perform gamut conversions between device profiles and color spaces. Gamut conversions, based on accurate device profiles, are the essence of color management. Historically, color management was not an initial design consideration of the X Window System on which much of Linux graphics support rests, and thus color-managed workflows have been somewhat more challenging to implement on Linux than on other OS's such as Microsoft Windows or macOS. This situation is now being progressively remedied, and color management under Linux, while functional, has not yet acquired mature status. Although it is now possible to obtain a consistent color-managed workflow under Linux, certain problems still remain: The absence of a central user control panel for color settings. Some hardware devices for color calibration lack Linux drivers, firmware or accessory data. Since ICC color profiles are written to an open specification, they are compatible across operating systems. Hence, a profile produced on one OS should work on any other OS given the availability of the necessary software to read it and perform the gamut conversions. This can be used as a workaround for the lack of support for certain spectrophotometers or colorimeters under Linux: one can simply produce a profile on a different OS and then use it in a Linux workflow. Additionally, certain hardware, such as most printers and certain monitors, can be calibrated under another OS and then used in a fully color-managed workflow on Linux. The popular Ubuntu Linux distribution added initial color management in the 11.10 release (the "Oneiric Ocelot" release). == Requirements for a color-managed workflow == Accurate device profiles obtained with source or output characterization software. Correctly loaded video card lookup tables (LUTs) (or monitor profiles that do not require LUT adjustments). Color-managed applications that are configured to use a correct monitor profile and input/output profiles, with support for control over the rendering intent and black point compensation. Calibration and profiling requires: for input devices (scanner, camera, etc.) a color target which the profiling software will compare to the manufacturer-provided color values of the target. or for output devices (monitor, printer, etc.) a reading with a specific device (spectrophotometer, colorimeter or spectrocolorimeter) of the color patch values and comparing the measured values against the values originally sent for output. === Monitor calibration and profiling === One of the critical elements in any color-managed workflow is the monitor, because, at one step or another, handling and making color adaptation through imaging software is required for most images, thus the ability of the monitor to present accurate colors is crucial. Monitor color management consists of calibration and profiling. The first step, calibration, is done by adjusting the monitor controls and the output of the graphics card (via calibration curves) to match user-definable characteristics, such as brightness, white point and gamma. The calibration settings are stored in a .cal file. The second step, profiling (characterization), involves measuring the calibrated display's response and recording it in a color profile. The profile is stored in an .icc file ("ICC file"). For convenience, the calibration settings are usually stored together with the profile in the ICC file. Note that .icm files are identical to .icc files - the difference is only in the name. Seeing correct colors requires using a monitor profile-aware application, together with the same calibration used when profiling the monitor. Calibration alone does not yield accurate colors. If a monitor was calibrated before it was profiled, the profile will only yield correct colors when used on the monitor with the same calibration (the same monitor control adjustments and the same calibration curves loaded into the video card's lookup table). macOS has built-in support for loading calibration curves and installing a system-wide color profile. Windows 7 onward allows loading calibration curves, though this functionality must be enabled manually. Linux and older versions of Windows require using a standalone LUT loader. === Device profiles === ICC profiles are cross-platform and can thus be created on other operating systems and used under Linux. Monitor profiles, however, require some additional attention. Since a monitor profile depends both on the monitor itself and on the video card, a monitor profile should only be used with the same monitor and video card with which it was created. The monitor settings should not be adjusted after creating the profile. In addition, since most calibration software use LUT adjustments during calibration, the corresponding LUTs must be loaded every time the display server (X11, Wayland) is started (e.g. with every graphical login). In the unlikely case of a colorimeter being unsupported by Linux, a profile created under Windows or macOS can be used under Linux. === Display-channel lookup tables === There are two approaches to loading display channel LUTs: Create a profile that does not modify video card LUTs and thus does not require LUTs be loaded later on. Ideally, this approach would rely on DDC-capable monitors—the internal monitor settings of which are set via calibration software. Unfortunately, monitors capable of making these adjustments through DDC are not common and are generally expensive. There is only one calibration software on Linux that can interact with a DDC monitor. For mainstream monitors, a couple of options exist: BasICColor software, which works with most colorimeters on the market, allows one to adjust display output via the monitor interface, and then to choose a "Profile, do not calibrate" option. By doing this, one can create a profile that does not require video card LUT adjustments. For EyeOne devices, EyeOne Match allows the user to calibrate to "Native" gamma and white point targets, which results in the LUT adjustment curves displayed after the calibration as a simple, linear 1:1 mapping (a straight line from corner to corner). Both BasICColor and EyeOne Match do not presently run under Linux but they are capable of creating a profile that does not require LUT adjustments. Use an LUT loader to actually load the LUT adjustments contained within the profile prepared during calibration. According to the documentation, these loaders do not modify the video card LUT by itself, but achieve the same type of adjustment by modifying the X server gamma ramp. Loaders are available for Linux distributions that use X.org or XFree86—the two most popular X servers on Linux. Other X servers are not guaranteed to work with the currently available loaders. There are two LUT loaders available for Linux: Xcalib is one such loader, and although it is a command-line utility, it is quite easy to use. dispwin is a part of Argyll CMS. If, for any reason, the LUT cannot be loaded, it is still recommended to go through the initial stages of calibration where a user is asked by calibration software to make some manual adjustments to the monitor, as this will often improve display linearity and also provide information on its color temperature. This is especially recommended for CRT monitors. === Color-managed applications === In ICC-aware applications, it is important to make sure the correct profiles are assigned to devices, mainly to the monitor and the printer. Some Linux applications can auto-detect the monitor profile, while others requires that it is specified manually. Although there is no designated place to store device profiles on Linux, /usr/share/color/icc/ has become the de facto standard. Most applications running under WINE have not been fully tested for color accuracy. While 8-bpp programs can have some color resolution difficulties due to depth conversion errors, colors in higher-depth applications should be accurate, as long as those programs perform their gamut conversions based on the same monitor profile as that used for loading the LUT, granted that the corresponding LUT adjustments are loaded. == List of color-managed applications == darktabl

    Read more →
  • List of Tesla Autopilot crashes

    List of Tesla Autopilot crashes

    Tesla Autopilot, a Level 2 advanced driver assistance system (ADAS), was released in October 2015 and the first fatal crashes involving the system occurred less than one year later. The fatal crashes attracted attention from news publications and United States government agencies, including the National Transportation Safety Board (NTSB) and National Highway Traffic Safety Administration (NHTSA), which has argued the Tesla Autopilot death rate is higher than the reported estimates. In addition to fatal crashes, there have been many nonfatal ones. Causes behind the incidents include the ADAS failing to recognize other vehicles, insufficient Autopilot driver engagement, and violating the operational design domain. As of October 2025, there have been hundreds of nonfatal incidents involving versions of Autopilot and sixty-five reported fatalities, fifty-four of which NHTSA investigations or expert testimony later verified and two that NHTSA's Office of Defect Investigations determined as happening during the engagement of Full Self-Driving (FSD) after 2022. Collectively, these cases culminated in a general recall in December 2023 of all vehicles equipped with Autopilot, which Tesla claims it resolved by an over-the-air software update. Immediately after closing its investigation in April 2024, NHTSA opened a recall query to determine the effectiveness of the recall. == Notable fatal crashes == === Handan, Hebei, China (January 20, 2016) === On January 20, 2016, Gao Yaning, the driver of a Tesla Model S in Handan, Hebei, China, was killed when his car crashed into a stationary truck. The Tesla was following a car in the far left lane of a multi-lane highway; the car in front moved to the right lane to avoid a truck stopped on the left shoulder, and the Tesla, which the driver's father believes was in Autopilot mode, did not slow before colliding with the stopped truck. According to footage captured by a dashboard camera, the stationary street sweeper on the left side of the expressway partially extended into the far left lane, and the driver did not appear to respond to the unexpected obstacle. Initially, Yaning was held responsible for the collision by local traffic police and, in September 2016, his family filed a lawsuit in July against the Tesla dealer who sold the car. The family's lawyer stated the suit was intended "to let the public know that self-driving technology has some defects. We are hoping Tesla when marketing its products, will be more cautious. Do not just use self-driving as a selling point for young people." Tesla released a statement which said they "have no way of knowing whether or not Autopilot was engaged at the time of the crash" since the car telemetry could not be retrieved remotely due to damage caused by the crash. In 2018, the lawsuit was stalled because telemetry was recorded locally to a SD card and was not able to be given to Tesla, who provided a decoding key to a third party for independent review. Tesla stated that "while the third-party appraisal is not yet complete, we have no reason to believe that Autopilot on this vehicle ever functioned other than as designed." Chinese media later reported that the family sent the information from that card to Tesla, which admitted Autopilot was engaged two minutes before the crash. Tesla since then removed the term "Autopilot" from its Chinese website. === Williston, Florida, US (May 7, 2016) === On May 7, 2016, Tesla driver Joshua Brown was killed in a crash with an 18-wheel tractor-trailer in Williston, Florida. By late June 2016, the NHTSA opened a formal investigation into the fatal autonomous accident, working with the Florida Highway Patrol. According to the NHTSA, preliminary reports indicate the crash occurred when the tractor-trailer made a left turn in front of the 2015 Tesla Model S at an intersection on a non-controlled access highway, and the car failed to apply the brakes. The car continued to travel after passing under the truck's trailer. The Tesla was eastbound in the rightmost lane of US 27, and the westbound tractor-trailer was turning left at the intersection with NE 140th Court, approximately 1 mi (1.6 km) west of Williston; the posted speed limit is 65 mph (105 km/h). The diagnostic log of the Tesla indicated it was traveling at a speed of 74 mi/h (119 km/h) when it collided with and traveled under the trailer, which was not equipped with a side underrun protection system. A reconstruction of the accident estimated the driver would have had approximately 10.4 seconds to detect the truck and take evasive action. The underride collision sheared off the Tesla's greenhouse, destroying everything above the beltline, and caused fatal injuries to the driver. In the approximately nine seconds after colliding with the trailer, the Tesla traveled another 886.5 feet (270.2 m) and came to rest after colliding with two chain-link fences and a utility pole. The NHTSA's preliminary evaluation was opened to examine the design and performance of any automated driving systems in use at the time of the crash, which involves a population of an estimated 25,000 Model S cars. On July 8, 2016, the NHTSA requested Tesla Inc. to hand over to the agency detailed information about the design, operation and testing of its Autopilot technology. The agency also requested details of all design changes and updates to Autopilot since its introduction, and Tesla's planned updates scheduled for the next four months. According to Tesla, "neither autopilot nor the driver noticed the white side of the tractor-trailer against a brightly lit sky, so the brake was not applied." The car attempted to drive full speed under the trailer, "with the bottom of the trailer impacting the windshield of the Model S". Tesla also stated that this was Tesla's first known Autopilot-related death in over 130 million miles (208 million km) driven by its customers while Autopilot was activated. According to Tesla there is a fatality every 94 million miles (150 million km) among all type of vehicles in the U.S. It is estimated that billions of miles will need to be traveled before Tesla Autopilot can claim to be safer than humans with statistical significance. Researchers say that Tesla and others need to release more data on the limitations and performance of automated driving systems if self-driving cars are to become safe and understood enough for mass-market use. The truck's driver told the Associated Press that he could hear a Harry Potter movie playing in the crashed car, and said the car was driving so quickly that "he went so fast through my trailer I didn't see him. [The film] was still playing when he died and snapped a telephone pole a quarter-mile down the road." According to the Florida Highway Patrol, they found in the wreckage an aftermarket portable DVD player. (It is not possible to watch videos on the Model S touchscreen display while the car is moving.) A laptop computer was recovered during the post-crash examination of the wreck, along with an adjustable vehicle laptop mount attached to the front passenger's seat frame. The NHTSA concluded the laptop was probably mounted, and the driver may have been distracted at the time of the crash. In January 2017, the NHTSA Office of Defects Investigations (ODI) released a preliminary evaluation, finding that the driver in the crash had seven seconds to see the truck and identifying no defects in the Autopilot system; the ODI also found that the Tesla car crash rate dropped by 40 percent after Autosteer installation, but later also clarified that it did not assess the effectiveness of this technology or whether it was engaged in its crash rate comparison. The NHTSA Special Crash Investigation team published its report in January 2018. According to the report, for the drive leading up to the crash, the driver engaged Autopilot for 37 minutes and 26 seconds, and the system provided 13 "hands not detected" alerts, to which the driver responded after an average delay of 16 seconds. The report concluded "Regardless of the operational status of the Tesla's ADAS technologies, the driver was still responsible for maintaining ultimate control of the vehicle. All evidence and data gathered concluded that the driver neglected to maintain complete control of the Tesla leading up to the crash." In July 2016, the NTSB announced it had opened a formal investigation into the fatal accident while Autopilot was engaged. The NTSB is an investigative body that only has the power to make policy recommendations. An agency spokesman said, "It's worth taking a look and seeing what we can learn from that event, so that as that automation is more widely introduced we can do it in the safest way possible." The NTSB opens annually about 25 to 30 highway investigations. In September 2017, the NTSB released its report, determining that "the probable cause of the Williston, Florida, crash was the truck driver's failure to yield the right of way to the car, combine

    Read more →
  • Generative literature

    Generative literature

    Generative literature is poetry or fiction that is automatically generated, often using computers. It is a genre of electronic literature, and also related to generative art. John Clark's Latin Verse Machine (1830–1843) is probably the first example of mechanised generative literature, while Christopher Strachey's love letter generator (1952) is the first digital example. With the large language models (LLMs) of the 2020s, generative literature is becoming increasingly common. == Definitions == Hannes Bajohr defines generative literature as literature involving "the automatic production of text according to predetermined parameters, usually following a combinatory, sometimes aleatory logic, and it emphasizes the production rather than the reception of the work (unlike, say, hypertext)." In his book Electronic Literature, Scott Rettberg connects generative literature to avant-garde literary movements like Dada, Surrealism, Oulipo and Fluxus. Bajohr argues that conceptual art is also an important reference. == Paradigms of generative literature == Bajohr describes two main paradigms of generative literature: the sequential paradigm, where the text generation is "executed as a sequence of rule-steps" and employs linear algorithms, and the connectionist paradigm, which is based on neural nets. The latter leads to what Bajohr calls a algorithmic empathy: "a non-anthropocentric empathy aimed not at the psychological states of the artists but at understanding the process of the work’s material production." == Poetry generation == The first examples of automated generative literature are poetry: John Clark's mechanical Latin Verse Machine (1830–1843) produced lines of hexameter verse in Latin, and Christopher Strachey's love letter generator (1952), programmed on the Manchester Mark 1 computer, generated short, satirical love letters. Examples of generative poetry using artificial neural networks include David Jhave Johnston's ReRites. == Narrative generation == Story generators have often followed specific narratological theories of how stories are constructed. An early example is Grimes' Fairy Tales, the "first to take a grammar-based approach and the first to operationalize Propp's famous model." Mike Sharples and Rafael Peréz y Peréz's book Story Machines gives a detailed history of story generation. Storyland by Nanette Wylde is an example of generative narrative. Jonathan Baillehache compares Storyland to Surrealist writing. Baillehache states, "When compared to earlier uses of chance operation in literature, a piece like this one resembles some of the automatic writings produced by André Breton and Philippe Soupault in their collective work The Magnetic Fields. . . The difference between Nanette Wylde’s Storyland and Breton and Soupault’s Magnetic Fields is that the former is produced according to a computational algorithm involving randomizers and user interaction, and the latter by two free-wheeling human subjects."

    Read more →
  • Vagueness

    Vagueness

    In linguistics and philosophy, a vague predicate is one which gives rise to borderline cases. For example, the English adjective "tall" is vague since it is not clearly true or false for someone of middling height. By contrast, the word "prime" is not vague since every number is definitively either prime or not. Vagueness is commonly diagnosed by a predicate's ability to give rise to the sorites paradox. Vagueness is separate from ambiguity, in which an expression has multiple denotations. For instance the word "bank" is ambiguous since it can refer either to a river bank or to a financial institution, but there are no borderline cases between both interpretations. Vagueness is a major topic of research in philosophical logic, where it serves as a potential challenge to classical logic. Work in formal semantics has sought to provide a compositional semantics for vague expressions in natural language. Work in philosophy of language has addressed implications of vagueness for the theory of meaning, while metaphysicists have considered whether reality itself is vague. == Importance == The concept of vagueness has philosophical importance. Suppose one wants to come up with a definition of "right" in the moral sense. One wants a definition to cover actions that are clearly right and exclude actions that are clearly wrong, but what does one do with the borderline cases? Surely, there are such cases. Some philosophers say that one should try to come up with a definition that is itself unclear on just those cases. Others say that one has an interest in making his or her definitions more precise than ordinary language, or his or her ordinary concepts, themselves allow; they recommend one advances precising definitions. === In law === Vagueness is also a problem which arises in law, and in some cases, judges have to arbitrate regarding whether a borderline case does, or does not, satisfy a given vague concept. Examples include disability (how much loss of vision is required before one is legally blind?), human life (at what point from conception to birth is one a legal human being, protected for instance by laws against murder?), adulthood (most familiarly reflected in legal ages for driving, drinking, voting, consensual sex, etc.), race (how to classify someone of mixed racial heritage), etc. Even such apparently unambiguous concepts such as biological sex can be subject to vagueness problems, not just from transsexuals' gender transitions but also from certain genetic conditions which can give an individual mixed male and female biological traits (see intersex). In the common law system, vagueness is a possible legal defence against by-laws and other regulations. The legal principle is that delegated power cannot be used more broadly than the delegator intended. Therefore, a regulation may not be so vague as to regulate areas beyond what the law allows. Any such regulation would be "void for vagueness" and unenforceable. This principle is sometimes used to strike down municipal by-laws that forbid "explicit" or "objectionable" contents from being sold in a certain city; courts often find such expressions to be too vague, giving municipal inspectors discretion beyond what the law allows. In the US this is known as the vagueness doctrine and in Europe as the principle of legal certainty. === In science === Many scientific concepts are of necessity vague, for instance species in biology cannot be precisely defined, owing to unclear cases such as ring species. Nonetheless, the concept of species can be clearly applied in the vast majority of cases. As this example illustrates, to say that a definition is "vague" is not necessarily a criticism. Consider those animals in Alaska that are the result of breeding huskies and wolves: are they dogs? It is not clear: they are borderline cases of dogs. This means one's ordinary concept of doghood is not clear enough to let us rule conclusively in this case. == Approaches == The philosophical question of what the best theoretical treatment of vagueness is—which is closely related to the problem of the paradox of the heap, a.k.a. sorites paradox—has been the subject of much philosophical debate. === Fuzzy logic === One theoretical approach is that of fuzzy logic, developed by American mathematician Lotfi Zadeh. Fuzzy logic proposes a gradual transition between "perfect falsity", for example, the statement "Bill Clinton is bald", to "perfect truth", for, say, "Patrick Stewart is bald". In ordinary logics, there are only two truth-values: "true" and "false". The fuzzy perspective differs by introducing an infinite number of truth-values along a spectrum between perfect truth and perfect falsity. Perfect truth may be represented by "1", and perfect falsity by "0". Borderline cases are thought of as having a "truth-value" anywhere between 0 and 1 (for example, 0.6). Advocates of the fuzzy logic approach have included K. F. Machina (1976) and Dorothy Edgington (1993). === Supervaluationism === Another theoretical approach is known as "supervaluationism". This approach has been defended by Kit Fine and Rosanna Keefe. Fine argues that borderline applications of vague predicates are neither true nor false, but rather are instances of "truth value gaps". He defends an interesting and sophisticated system of vague semantics, based on the notion that a vague predicate might be "made precise" in many alternative ways. This system has the consequence that borderline cases of vague terms yield statements that are neither true, nor false. Given a supervaluationist semantics, one can define the predicate "supertrue" as meaning "true on all precisifications". This predicate will not change the semantics of atomic statements (e.g. "Frank is bald", where Frank is a borderline case of baldness), but does have consequences for logically complex statements. In particular, the tautologies of sentential logic, such as "Frank is bald or Frank is not bald", will turn out to be supertrue, since on any precisification of baldness, either "Frank is bald" or "Frank is not bald" will be true. Since the presence of borderline cases seems to threaten principles like this one (excluded middle), the fact that supervaluationism can "rescue" them is seen as a virtue. === Subvaluationism === Subvaluationism is the logical dual of supervaluationism, and has been defended by Dominic Hyde (2008) and Pablo Cobreros (2011). Whereas the supervaluationist characterises truth as 'supertruth', the subvaluationist characterises truth as 'subtruth', or "true on at least some precisifications". Subvaluationism proposes that borderline applications of vague terms are both true and false. It thus has "truth-value gluts". According to this theory, a vague statement is true if it is true on at least one precisification and false if it is false under at least one precisification. If a vague statement comes out true under one precisification and false under another, it is both true and false. Subvaluationism ultimately amounts to the claim that vagueness is a truly contradictory phenomenon. Of a borderline case of "bald man" it would be both true and false to say that he is bald, and both true and false to say that he is not bald. === Epistemicist view === A fourth approach, known as "the epistemicist view", has been defended by Timothy Williamson (1994), R. A. Sorensen (1988) and (2001), and Nicholas Rescher (2009). They maintain that vague predicates do, in fact, draw sharp boundaries, but that one cannot know where these boundaries lie. One's confusion about whether some vague word does or does not apply in a borderline case is due to one's ignorance. For example, in the epistemicist view, there is a fact of the matter, for every person, about whether that person is old or not old; some people are ignorant of this fact. === As a property of objects === One possibility is that one's words and concepts are perfectly precise, but that objects themselves are vague. Consider Peter Unger's example of a cloud (from his famous 1980 paper, "The Problem of the Many"): it is not clear where the boundary of a cloud lies; for any given bit of water vapor, one can ask whether it is part of the cloud or not, and for many such bits, one will not know how to answer. Hence, perhaps such a term as 'cloud' is not itself vague, but rather precisely denotes a vague object. This strategy has occasionally been poorly received; most notably, in Gareth Evans' short paper "Can There Be Vague Objects?" (1978), wherein an argument is examined which appears to show that vague identity-statements are impossible (i.e., result in logical incoherence). David Lewis explains that the reader is intended to conclude, with Evans, that—since there clearly are, in fact, meaningful vague identities—any purported proof to the contrary cannot be right; and as the proof relies upon the premise that vague terms precisely denote vague objects, but fails under the view that vague terms reflect a merel

    Read more →
  • Cost-sensitive machine learning

    Cost-sensitive machine learning

    Cost-sensitive machine learning is an approach within machine learning that considers varying costs associated with different types of errors. This method diverges from traditional approaches by introducing a cost matrix, explicitly specifying the penalties or benefits for each type of prediction error. The inherent difficulty which cost-sensitive machine learning tackles is that minimizing different kinds of classification errors is a multi-objective optimization problem. == Overview == Cost-sensitive machine learning optimizes models based on the specific consequences of misclassifications, making it a valuable tool in various applications. It is especially useful in problems with a high imbalance in class distribution and a high imbalance in associated costs Cost-sensitive machine learning introduces a scalar cost function in order to find one (of multiple) Pareto optimal points in this multi-objective optimization problem (similar to the Weighted sum model) == Cost Matrix == The cost matrix is a crucial element within cost-sensitive modeling, explicitly defining the costs or benefits associated with different prediction errors in classification tasks. Represented as a table, the matrix aligns true and predicted classes, assigning a cost value to each combination. For instance, in binary classification, it may distinguish costs for false positives and false negatives. The utility of the cost matrix lies in its application to calculate the expected cost or loss. The formula, expressed as a double summation, utilizes joint probabilities: Expected Loss = ∑ i ∑ j P ( Actual i , Predicted j ) ⋅ Cost Actual i , Predicted j {\displaystyle {\text{Expected Loss}}=\sum _{i}\sum _{j}P({\text{Actual}}_{i},{\text{Predicted}}_{j})\cdot {\text{Cost}}_{{\text{Actual}}_{i},{\text{Predicted}}_{j}}} Here, P ( Actual i , Predicted j ) {\displaystyle P({\text{Actual}}_{i},{\text{Predicted}}_{j})} denotes the joint probability of actual class i {\displaystyle i} and predicted class j {\displaystyle j} , providing a nuanced measure that considers both the probabilities and associated costs. This approach allows practitioners to fine-tune models based on the specific consequences of misclassifications, adapting to scenarios where the impact of prediction errors varies across classes. == Applications == === Fraud Detection === In the realm of data science, particularly in finance, cost-sensitive machine learning is applied to fraud detection. By assigning different costs to false positives and false negatives, models can be fine-tuned to minimize the overall financial impact of misclassifications. === Medical Diagnostics === In healthcare, cost-sensitive machine learning plays a role in medical diagnostics. The approach allows for customization of models based on the potential harm associated with misdiagnoses, ensuring a more patient-centric application of machine learning algorithms. == Challenges == A typical challenge in cost-sensitive machine learning is the reliable determination of the cost matrix which may evolve over time. == Literature == Cost-Sensitive Machine Learning. USA, CRC Press, 2011. ISBN 9781439839287 Abhishek, K., Abdelaziz, D. M. (2023). Machine Learning for Imbalanced Data: Tackle Imbalanced Datasets Using Machine Learning and Deep Learning Techniques. (n.p.): Packt Publishing. ISBN 9781801070881

    Read more →
  • Algorithmic accountability

    Algorithmic accountability

    Algorithmic accountability refers to the allocation of responsibility for the consequences of real-world actions influenced by algorithms used in decision-making processes. Ideally, algorithms should be designed to eliminate bias from their decision-making outcomes. This means they ought to evaluate only relevant characteristics of the input data, avoiding distinctions based on attributes that are generally inappropriate in social contexts, such as an individual's ethnicity in legal judgments. However, adherence to this principle is not always guaranteed, and there are instances where individuals may be adversely affected by algorithmic decisions. Responsibility for any harm resulting from a machine's decision may lie with the algorithm itself or with the individuals who designed it, particularly if the decision resulted from bias or flawed data analysis inherent in the algorithm's design. == Algorithm usage == Algorithms are widely utilized across various sectors of society that incorporate computational techniques in their control systems. These applications span numerous industries, including but not limited to medical, transportation, and payment services. In these contexts, algorithms perform functions such as: Approving or denying credit card applications; Approving or denying immigrant visas; Determining which taxpayers will be audited on their income taxes; Managing systems that control self-driving cars on a highway; Scoring individuals as potential criminals for use in legal proceedings; Search engines that match and rank database and internet search results; Recommendation systems that filter which news, entertainment, or purchase items are featured in a feed; Market-making algorithms that match sellers and buyers, such as in transportation (ride-hailing) or financial platforms. However, the implementation of these algorithms can be complex and opaque. Generally, algorithms function as "black boxes," meaning that the specific processes an input undergoes during execution are often not transparent, with users typically only seeing the resulting output. This lack of transparency raises concerns about potential biases within the algorithms, as the parameters influencing decision-making may not be well understood. The outputs generated can lead to perceptions of bias, especially if individuals in similar circumstances receive different results. According to Nicholas Diakopoulos: But these algorithms can make mistakes. They have biases. Yet they sit in opaque black boxes, their inner workings, their inner “thoughts” hidden behind layers of complexity. We need to get inside that black box, to understand how they may be exerting power on us, and to understand where they might be making unjust mistakes == Wisconsin Supreme Court case == Algorithms are prevalent across various fields and significantly influence decisions that affect the population at large. Their underlying structures and parameters often remain unknown to those impacted by their outcomes. A notable case illustrating this issue is a recent ruling by the Wisconsin Supreme Court concerning "risk assessment" algorithms used in criminal justice. The court determined that scores generated by such algorithms, which analyze multiple parameters from individuals, should not be used as a determining factor for arresting an accused individual. Furthermore, the court mandated that all reports submitted to judges must include information regarding the accuracy of the algorithm used to compute these scores. This ruling is regarded as a noteworthy development in how society should manage software that makes consequential decisions, highlighting the importance of reliability, particularly in complex settings like the legal system. The use of algorithms in these contexts necessitates a high degree of impartiality in processing input data. However, experts note that there is still considerable work to be done to ensure the accuracy of algorithmic results. Questions about the transparency of data processing continue to arise, which raises issues regarding the appropriateness of the algorithms and the intentions of their designers. == Controversies == A notable instance of potential algorithmic bias is highlighted in an article by The Washington Post regarding the ride-hailing service Uber. An analysis of collected data revealed that estimated waiting times for users varied based on the neighborhoods in which they resided. Key factors influencing these discrepancies included the predominant ethnicity and average income of the area. Specifically, neighborhoods with a majority white population and higher economic status tended to have shorter waiting times, while those with more diverse ethnic compositions and lower average incomes experienced longer waits. It’s important to clarify that this observation reflects a correlation identified in the data, rather than a definitive cause-and-effect relationship. No value judgments are made regarding the behavior of the Uber app in these cases. In TechCrunch website, Hemant Taneja wrote: Concern about “black box” algorithms that govern our lives has been spreading. New York University’s Information Law Institute hosted a conference on algorithmic accountability, noting: “Scholars, stakeholders, and policymakers question the adequacy of existing mechanisms governing algorithmic decision-making and grapple with new challenges presented by the rise of algorithmic power in terms of transparency, fairness, and equal treatment.” Yale Law School’s Information Society Project is studying this, too. “Algorithmic modeling may be biased or limited, and the uses of algorithms are still opaque in many critical sectors,” the group concluded. == Possible solutions == Discussions among experts have sought viable solutions to understand the operations of algorithms, often referred to as "black boxes." It is generally proposed that companies responsible for developing and implementing these algorithms should ensure their reliability by disclosing the internal processes of their systems. Hemant Taneja, writing for TechCrunch, emphasizes that major technology companies, such as Google, Amazon, and Uber, must actively incorporate algorithmic accountability into their operations. He suggests that these companies should transparently monitor their own systems to avoid stringent regulatory measures. One potential approach is the introduction of regulations in the tech sector to enforce oversight of algorithmic processes. However, such regulations could significantly impact software developers and the industry as a whole. It may be more beneficial for companies to voluntarily disclose the details of their algorithms and decision-making parameters, which could enhance the trustworthiness of their solutions. Another avenue discussed is the possibility of self-regulation by the companies that create these algorithms, allowing them to take proactive steps in ensuring accountability and transparency in their operations. In TechCrunch website, Hemant Taneja wrote: There’s another benefit — perhaps a huge one — to software-defined regulation. It will also show us a path to a more efficient government. The world’s legal logic and regulations can be coded into software and smart sensors can offer real-time monitoring of everything from air and water quality, traffic flows and queues at the DMV. Regulators define the rules, technologist create the software to implement them and then AI and ML help refine iterations of policies going forward. This should lead to much more efficient, effective governments at the local, national and global levels.

    Read more →
  • Proof assistant

    Proof assistant

    In computer science and mathematical logic, a proof assistant or interactive theorem prover is a software tool to assist with the development of formal proofs by human–machine collaboration. This involves some sort of interactive proof editor, or other interface, with which a human can guide the search for proofs, the details of which are stored in, and some steps provided by, a computer. A recent effort within this field is making these tools use artificial intelligence to automate the formalization of ordinary mathematics. == Automated proof checking == Automated proof checking is the process of using software for checking proofs for correctness. It is one of the most developed fields in automated reasoning. Automated proof checking differs from automated theorem proving in that automated proof checking simply mechanically checks the formal workings of an existing proof, instead of trying to develop new proofs or theorems itself. Because of this, the task of automated proof verification is much simpler than that of automated theorem proving, allowing automated proof checking software to be much simpler than automated theorem proving software. Because of this small size, some automated proof checking systems can have less than a thousand lines of core code, and are thus themselves amenable to both hand-checking and automated software verification. The Mizar system, HOL Light, and Metamath are examples of automated proof checking systems. Automated proof checking can be done either as a batch operation, or interactively, as part of an interactive theorem proving system. == History == Automath, which was developed by Nicolaas Govert de Bruijn starting in 1967, is often considered the first proof checker and the first system to utilize the Curry–Howard correspondence between programs and proofs. Automath was used by L.S. van Benthem Jutting in 1977 to formalize Landau's Foundations of Analysis, which was the first formalization of the real numbers. In 1973, Robert Boyer and J Moore published Proving Theorems about LISP Functions which aimed to verify programs, not mathematics. Their theorem prover is now known as ACL2. In the 1970s, Edinburgh LCF introduced the idea of using a functional programming language as the metalanguage for a theorem prover, and led to the HOL family of proof assistants. The 1990s saw the rise of Rocq, (then known as Coq), which has been used for many large-scale formalization projects. Since the late 2010s, Lean, a proof assistant strongly influenced by Rocq, has become another popular choice, especially for formalizing mathematics. == System comparison == ACL2 – a programming language, a first-order logical theory, and a theorem prover (with both interactive and automatic modes) in the Boyer–Moore tradition. HOL theorem provers – A family of tools ultimately derived from the LCF theorem prover. In these systems, the logical core is a library of their programming language. Theorems represent new elements of the language and can only be introduced via "strategies" which guarantee logical correctness. Strategy composition gives users the ability to produce significant proofs with relatively few interactions with the system. Members of the family include: HOL4 – The "primary descendant", still under active development. Support for both Moscow ML and Poly/ML. Has a BSD-style license. HOL Light – A thriving "minimalist fork". OCaml based. ProofPower – Went proprietary, then returned to open source. Based on Standard ML. IMPS, An Interactive Mathematical Proof System. Isabelle is an interactive theorem prover where other systems can be encoded. Isabelle/HOL is its most popular instance, whose foundation is close to that of the HOL prover. Other instances include Isabelle/ZF and Isabelle/FOL. The main code-base is BSD-licensed, but the Isabelle distribution bundles many add-on tools with different licenses. Jape – Java based. Lean is both an interactive theorem prover and a functional, dependently-typed programming language. It is based on the calculus of inductive constructions with non-cumulative universes. Since version 4 (released in 2023), it is self-hosting. It can be used to formalise mathematics (and has a large, coherent library for formal mathematics), but also for software and hardware verification. LEGO Matita – A light system based on the calculus of inductive constructions. MINLOG – A proof assistant based on first-order minimal logic. Mizar – A proof assistant based on first-order logic, in a natural deduction style, and Tarski–Grothendieck set theory. PhoX – A proof assistant based on higher-order logic which is eXtensible. Prototype Verification System (PVS) – a proof language and system based on higher-order logic. Rocq (formerly named Coq) – A popular interactive theorem prover based on the calculus of inductive constructions. Theorem Proving System (TPS) and ETPS – Interactive theorem provers also based on simply typed lambda calculus, but based on an independent formulation of the logical theory and independent implementation. == User interfaces == A commonly used front-end for proof assistants was the Emacs-based Proof General, developed at the University of Edinburgh. Nowadays, many provers include their own editor. Rocq includes RocqIDE, which is based on OCaml/Gtk. Isabelle includes Isabelle/jEdit, which is based on jEdit and the Isabelle/Scala infrastructure for document-oriented proof processing. More recently, Visual Studio Code extensions have been developed for Rocq, Isabelle by Makarius Wenzel, and for Lean 4 by the leanprover developers. == Formalization extent == Freek Wiedijk has been keeping a ranking of proof assistants by the amount of formalized theorems out of a list of 100 well-known theorems. As of September 2025, only six systems have formalized proofs of more than 70% of the theorems, namely Isabelle, HOL Light, Lean, Rocq, Metamath and Mizar. == Notable formalized proofs == The following is a list of notable proofs that have been formalized within proof assistants.

    Read more →
  • Anna Ridler

    Anna Ridler

    Anna Ridler (born 1985) is an artist who works with machine learning, handmade archives and moving image. She builds her own datasets to expose the labour and ideology embedded in the systems that organise knowledge. Her work is held in the permanent collections of the Whitney Museum of American Art, the Victoria and Albert Museum, M+ and ZKM Center for Art and Media Karlsruhe, and has been exhibited widely at cultural institutions including Tate Modern, Barbican Centre, Centre Pompidou, The Photographers' Gallery, Taipei Fine Arts Museum, MIT Museum, Kunsthaus Graz, ZKM Center for Art and Media Karlsruhe and Ars Electronica. == Biography == Born in London in 1985, Ridler spent her childhood raised between Atlanta, Georgia and the United Kingdom. She obtained a Bachelor of Arts in English Literature and Language from Oxford University in 2007 and a Master of Arts in Information Experience Design from the Royal College of Art in 2017. == Art practice == Ridler's practice uses technology, and in particular machine learning, to investigate how naming, classification and financial speculation determine what can be seen and what is erased. A core element of Ridler's work lies in the creation of handmade data sets through a laborious process of selecting and classifying images and text. By creating her own data sets, Ridler is able to uncover and expose underlying themes and concepts while also inverting the usual process of scraping pre-classified images found in large databases on the Internet. She began working with machine learning as an artistic material in 2017, at a moment when the technology required building every dataset by hand; that constraint became the foundation of the practice. Her interests are in drawing, machine learning, data collection, storytelling and technology. == Work == Some of Ridler's most notable works to date fall within her ‘tulip series’ which explores the hysteria around tulip mania and compares it to the speculation and bubbles surrounding cryptocurrencies. The series is expressed in three forms: a photographic dataset in Myriad (Tulips), 2018; two iterations of machine generated videos in Mosaic Virus (2018) and Mosaic Virus (2019); and a website with an accompanied functioning decentralized application in Bloemenveiling (2019). === Myriad (Tulips) (2018) === I wanted to draw together ideas around capitalism, value, and the tangible and intangible nature of speculation, and collapse from two very different yet surprisingly similar moments in history. Myriad (Tulips) (2018) is an installation of ten thousand hand-labeled photographs forming a dataset of unique tulips. The ten thousand, or myriad of, photographs were taken by Ridler over the course of three months, roughly the length of a tulip season, spent in Utrecht. Each photograph is carefully affixed one by one with magnets to a specially painted black wall in a laborious process to form a seemingly precise grid. Myriad (Tulips) (2018) has been exhibited in AI: More than Human, Barbican Centre, London, UK (May 16 - August 26, 2019); Error—The Art of Imperfection, Ars Electronica Export, Berlin, Germany (November 17, 2018 – March 3, 2019); Peer to Peer, Shanghai Centre of Photography, Shanghai, China (December 8 - February 9, 2020). The work was featured in Bloomberg, It’s Nice That, and Hyperallergic. For Myriad (Tulips), Ridler was nominated for a Beazley Design of the Year award for her presentation of an alternative perspective on how to engage with artificial intelligence; demonstrating a departure from ownership and control of major corporations to a more personalized process of constructing and conceptualizing from the ground-up. === Mosaic Virus (2018, 2019) === Mosaic Virus (2018) is a single screen video installation displaying a grid of continually evolving tulips in bloom. For Mosaic Virus (2019) Ridler used three screens. The appearance of the tulips is controlled by artificial intelligence using fluctuations in the price of bitcoin. The stripes on the tulips' petals reflect the value of the cryptocurrency. Ridler draws parallels with the tulip mania of the 17th century; representing the hysteria and speculation around crypto-currencies. The work takes its name from the mosaic virus which caused stripes in tulip petals, subsequently increasing their desirability and leading to speculative prices. Ridler trained a general adversarial network (GAN) on the set of ten thousand photographs of individual tulips from her work Myriad (Tulips). She used a technique called spectral normalization to improve the output. The work was exhibited in Error—The Art of Imperfection, Ars Electronica Export, Berlin, Germany (November 17, 2018 – March 3, 2019). === Bloemenveiling (2019) === Bloemenveiling (2019) is an auction of artificial-intelligence-generated tulips on the blockchain in the form of a functioning decentralized application: http://bloemenveiling.bid. Ridler collaborated with senior research scientist at DeepMind, David Pfau to investigate whether blockchain could be used as a means of finding poetic substance within it. The piece interrogates the way technology drives human desire and economic dynamics by creating artificial scarcity. In the work, short moving image pieces of tulips created by generative adversarial networks are sold at auction using smart contracts on the Ethereum network. Each time a tulip is sold, thousands of computers around the world all work to verify the transaction, checking each other's work against each other. While the artificial intelligence behind the moving image pieces has the potential to generate infinite flowers, the enormous distributed network is used, at great environmental cost, to introduce scarcity to an otherwise limitless resource. Bloemenveiling was exhibited in Entangled Realities, HEK Basel, Basel, Switzerland in 2019. == Solo exhibitions == Anna Ridler, Circadian Bloom, ZKM Center for Art and Media, Karlsruhe, (2023) Anna Ridler, Time Blooms, Buk Seoul Museum of Art, Seoul, (2025) Anna Ridler, Trace Remains, Galerie Nagel Draxler, Cologne, (2026) Anna Ridler, Laws of Ordered Form, The Photographers' Gallery, London (2020); The Abstraction of Nature, Aksioma, Ljubljana (2020) == Awards and recognition == European Union EMAP Fellow (2018) DARE Art Prize (2018–2019) Featured in Thames & Hudson, Digital Art (1960s–Now) Featured in British Art: The Last 15 Years ABS Digital Artist of the Year (2025)

    Read more →
  • Document classification

    Document classification

    Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach. == "Content-based" versus "request-based" classification == Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach. == Classification versus indexing == Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, assigning a subject term to a document in an index is equivalent to assigning that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). == Automatic document classification (ADC) == Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available. === Techniques === Automatic document classification techniques include: Artificial neural network Concept Mining Decision trees such as ID3 or C4.5 Expectation maximization (EM) Instantaneously trained neural networks Latent semantic indexing Multiple-instance learning Naive Bayes classifier Natural language processing approaches Rough set-based classifier Soft set-based classifier Support vector machines (SVM) K-nearest neighbour algorithms tf–idf == Applications == Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam messages from legitimate emails email routing, sending an email sent to a general address to a specific address or mailbox depending on topic language identification, automatically determining the language of a text genre classification, automatically determining the genre of a text readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. health-related classification using social media in public health surveillance article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology

    Read more →
  • Rumelhart Prize

    Rumelhart Prize

    The David E. Rumelhart Prize for Contributions to the Theoretical Foundations of Human Cognition was founded in 2001 in honor of the cognitive scientist David Rumelhart to introduce the equivalent of a Nobel Prize for cognitive science. It is awarded annually to "an individual or collaborative team making a significant contemporary contribution to the theoretical foundations of human cognition". The annual award is presented at the Cognitive Science Society meeting, where the recipient gives a lecture and receives a check for $100,000. At the conclusion of the ceremony, the next year's award winner is announced. The award is funded by the Robert J. Glushko and Pamela Samuelson Foundation. The Rumelhart Prize committee is independent of the Cognitive Science Society. However, the society provides a large and interested audience for the awards. == Selection Committee == As of 2022, the selection committee for the prize consisted of: Richard Cooper (chair) Dedre Gentner Robert J. Glushko Tania Lombrozo Steven T. Piantadosi Jesse Snedeker == Recipients ==

    Read more →
  • Model collapse

    Model collapse

    Model collapse, also known by other names such as "AI inbreeding", "AI cannibalism", "Habsburg AI", and "model autophagy disorder" or "MAD" is a phenomenon noted in artificial intelligence studies, where machine learning models gradually degrade due to errors coming from uncurated synthetic data, or due to training on the outputs of another model such as prior versions of itself. It is unclear to what extent the phenomenon threatens the long-term development of such models, and some techniques have been proposed to mitigate the effect. == Characteristics == Shumailov et al. coined the term to describe two specific stages to the degradation of machine learning models: early model collapse and late model collapse: In early model collapse, the model begins losing information about the tails of the distribution – mostly affecting minority data. Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data. In late model collapse, the model loses a significant proportion of its performance, confusing concepts and losing most of its variance. == Mechanism == Using synthetic data as training data can lead to issues with the quality and reliability of the trained model. Model collapse occurs for three main reasons: functional approximation errors sampling errors learning errors Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors often compound, leading to faster collapse. == Disagreement over real-world impact == Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet. If training on "slop" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem. However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided. The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared. An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out. == Mathematical models of the phenomenon == === 1D Gaussian model === In 2024, a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional normal distribution fit using unbiased estimators of mean and variance, computed on samples from the previous generation. To make this more precise, we say that original data follows a normal distribution X 0 ∼ N ( μ , σ 2 ) {\displaystyle X^{0}\sim {\mathcal {N}}(\mu ,\sigma ^{2})} , and we possess M 0 {\displaystyle M_{0}} samples X j 0 {\displaystyle X_{j}^{0}} for j ∈ { 1 , … , M 0 } {\displaystyle j\in {\{\,1,\dots ,M_{0}\,{}\}}} . Denoting a general sample X j i {\displaystyle X_{j}^{i}} as sample j ∈ { 1 , … , M i } {\displaystyle j\in {\{\,1,\dots ,M_{i}\,{}\}}} at generation i {\displaystyle i} , then the next generation model is estimated using the sample mean and variance: μ i + 1 = 1 M i ∑ j X j i ; σ i + 1 2 = 1 M i − 1 ∑ j ( X j i − μ i + 1 ) 2 . {\displaystyle \mu _{i+1}={\frac {1}{M_{i}}}\sum _{j}X_{j}^{i};\quad \sigma _{i+1}^{2}={\frac {1}{M_{i}-1}}\sum _{j}(X_{j}^{i}-\mu _{i+1})^{2}.} Leading to a conditionally normal next generation model X j i + 1 | μ i + 1 , σ i + 1 ∼ N ( μ i + 1 , σ i + 1 2 ) {\displaystyle X_{j}^{i+1}|\mu _{i+1},\;\sigma _{i+1}\sim {\mathcal {N}}(\mu _{i+1},\sigma _{i+1}^{2})} . In theory, this is enough to calculate the full distribution of X j i {\displaystyle X_{j}^{i}} . However, even after the first generation, the full distribution is no longer normal: It follows a variance-gamma distribution. To continue the analysis, instead of writing the probability density function at each generation, it is possible to explicitly construct them in terms of independent random variables using Cochran's theorem. To be precise, μ 1 {\displaystyle \mu _{1}} and σ 1 {\displaystyle \sigma _{1}} are independent, with μ 1 ∼ N ( μ , σ 2 M 0 ) {\displaystyle \mu _{1}\sim {\mathcal {N}}\left(\mu ,{\frac {\sigma ^{2}}{M_{0}}}\right)} and ( M 0 − 1 ) σ 1 2 ∼ σ 2 Γ ( M 0 − 1 2 , 1 2 ) {\displaystyle (M_{0}-1)\,\sigma _{1}^{2}\sim \sigma ^{2}\,\Gamma \left({\frac {M_{0}-1}{2}},{\frac {1}{2}}\right)} , following a Gamma distribution. Denoting with Z {\displaystyle Z} Gaussian random variables distributed according to N ( 0 , 1 ) {\displaystyle {\mathcal {N}}(0,1)} and with S i {\displaystyle S^{i}} random variables distributed with 1 M i − 1 − 1 Γ ( M i − 1 − 1 2 , 1 2 ) {\displaystyle {\frac {1}{M_{i-1}-1}}\Gamma \left({\frac {M_{i-1}-1}{2}},{\frac {1}{2}}\right)} , it turns out to be possible to write samples at each generation as X j 0 = μ + σ Z j 0 , {\textstyle X_{j}^{0}=\mu +\sigma Z_{j}^{0},} X j 1 = μ + σ M 0 Z 1 + σ S 1 Z j 1 , {\textstyle X_{j}^{1}=\mu +{\frac {\sigma }{\sqrt {M_{0}}}}Z^{1}+\sigma {\sqrt {S^{1}}}Z_{j}^{1},} and more generally X j n = μ + σ M 0 Z 1 + σ M 1 S 1 Z 2 + ⋯ + σ M n − 1 S 1 × ⋯ × S n − 1 Z n + σ S 1 × ⋯ × S n Z j n . {\displaystyle X_{j}^{n}=\mu +{\frac {\sigma }{\sqrt {M_{0}}}}Z^{1}+{\frac {\sigma }{\sqrt {M_{1}}}}{\sqrt {S^{1}}}Z^{2}+\dots +{\frac {\sigma }{\sqrt {M_{n-1}}}}{\sqrt {S^{1}\times \dots \times S^{n-1}}}Z^{n}+\sigma {\sqrt {S^{1}\times \dots \times S^{n}}}Z_{j}^{n}.} Note, that these are not joint distributions, as Z n {\displaystyle Z^{n}} and S n {\displaystyle S^{n}} depend directly on Z j n − 1 {\displaystyle Z_{j}^{n-1}} , but when considering X j n {\displaystyle X_{j}^{n}} on its own the formula above provides all the information about the full distribution. To analyse the model collapse, we can first calculate variance and mean of samples at generation n {\displaystyle n} . This would tell us what kind of distributions we expect to arrive at after n {\displaystyle n} generations. It is possible to find its exact value in closed form, but the mean and variance of the square root of gamma distribution are expressed in terms of gamma functions, making the result quite clunky. Following, it is possible to expand all results to second order in each of 1 / M i {\displaystyle 1/M_{i}} , assuming each sample size to be large. It is then possible to show that 1 σ 2 Var ⁡ ( X j n ) = 1 M 0 + 1 M 1 + ⋯ + 1 M n − 1 + 1 + O ( M i − 2 ) . {\displaystyle {\frac {1}{\sigma ^{2}}}\operatorname {Var} (X_{j}^{n})={\frac {1}{M_{0}}}+{\frac {1}{M_{1}}}+\dots +{\frac {1}{M_{n-1}}}+1+{\mathcal {O}}\left(M_{i}^{-2}\right).} And if all sample sizes M i = M {\displaystyle M_{i}=M} are constant, this diverges linearly as n → ∞ {\displaystyle n\to \infty } : Var ⁡ ( X j n ) = σ 2 ( 1 + n M ) ; E ( X j n ) = μ . {\displaystyle \operatorname {Var} (X_{j}^{n})=\sigma ^{2}\left(1+{\frac {n}{M}}\right);\quad \mathbb {E} (X_{j}^{n})=\mu .} This is the same scaling as for a single dimensional Gaussian random walk. However, divergence of the variance of X j n {\displaystyle X_{j}^{n}} does not directly provide any information about the corresponding estimates of μ n + 1 {\displaystyle \mu _{n+1}} and σ n + 1 {\displaystyle \sigma _{n+1}} , particularly how different they are from the original μ {\displaystyle \mu } and σ {\displaystyle \sigma } . It turns out to be possible to calculate the distance between the true distribution and the approximated distribution at step n + 1 {\displaystyle n+1} , using the Wasserstein-2 distance (which is also sometimes referred to as risk): E [ W 2 2 ( N ( μ , σ 2 ) , N ( μ n + 1 , σ n + 1 2 ) ) ] = 3 2 σ 2 ( 1 M 0 + 1 M 1 + ⋯ + 1 M n ) + O ( M i − 2 ) , {\displaystyle \mathbb {E} \left[\mathbb {W} _{2}^{2}\left({\mathcal {N}}(\mu ,\sigma ^{2}),{\mathcal {N}}(\mu _{n+1},\sigma _{n+1}^{2})\right)\right]={\frac {3}{2}}\sigma ^{2}\left({\frac {1}{M_{0}}}+{\frac {1}{M_{1}}}+\dots +{\frac {1}{M_{n}}}\right)+{\mathcal {O}}\left(M_{i}^{-2}\right),} Var ⁡ [ W 2 2 ( N ( μ , σ 2 ) , N ( μ n + 1 , σ n + 1 2 ) ) ] = 1 2 σ 4 ( 3 M 0 2 + 3 M 1 2 + ⋯ + 3 M n 2 + ∑ i ≠ j 4 M i M j ) + O ( M i − 3 ) . {\displaystyle \operatorname {Var} \left[\mathbb {W} _{2}^{2}\left({\mathcal {N}}(\mu ,\sigma ^{2}),{\mathcal {N}}(\mu _{n+1},\sigma _{n+1}^{2})\right)\right]={\frac {1}{2}}\sigma ^{4}\left({\frac {3}{M_{0}^{2}}}+{\frac {3}{M_{1}^{2}}}+\dots +{\frac {3}{M_{n}^{2}}}+\sum _{i\neq j}{\frac {4}{M_{i}M_{j}}}\right)+{\mathcal {O}}\left(M_{i}^{-3}\right).} This directly shows why model collapse occurs in this simple model. Due to errors from re-sampling the approximated distribution, each generation ends up corresponding to a

    Read more →