AI Data Facility

AI Data Facility — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Solomonoff's theory of inductive inference

    Solomonoff's theory of inductive inference

    Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical (state-space model) character of Algorithmic Information Theory, it encompasses statistical as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current scientific theory is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes Occam's razor by assigning larger prior credences to theories that require a shorter algorithmic description. == Origin == === Philosophical === The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action. === Principle === Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability P [ T | D ] {\displaystyle \mathbb {P} [T|D]} of a theory T {\displaystyle T} given data D {\displaystyle D} by applying Bayes rule, which yields P [ T | D ] = P [ D | T ] P [ T ] P [ D | T ] P [ T ] + ∑ A ≠ T P [ D | A ] P [ A ] {\displaystyle \mathbb {P} [T|D]={\frac {\mathbb {P} [D|T]\mathbb {P} [T]}{\mathbb {P} [D|T]\mathbb {P} [T]+\sum _{A\neq T}\mathbb {P} [D|A]\mathbb {P} [A]}}} where theories A {\displaystyle A} are alternatives to theory T {\displaystyle T} . For this equation to make sense, the quantities P [ D | T ] {\displaystyle \mathbb {P} [D|T]} and P [ D | A ] {\displaystyle \mathbb {P} [D|A]} must be well-defined for all theories T {\displaystyle T} and A {\displaystyle A} . In other words, any theory must define a probability distribution over observable data D {\displaystyle D} . Solomonoff's induction essentially boils down to demanding that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F {\displaystyle F} , by simply obeying the laws of probability. Namely, we have P [ F | D ] = E T [ P [ F | T , D ] ] = ∑ T P [ F | T , D ] P [ T | D ] {\displaystyle \mathbb {P} [F|D]=\mathbb {E} _{T}[\mathbb {P} [F|T,D]]=\sum _{T}\mathbb {P} [F|T,D]\mathbb {P} [T|D]} . This quantity can be interpreted as the average predictions P [ F | T , D ] {\displaystyle \mathbb {P} [F|T,D]} of all theories T {\displaystyle T} given past data D {\displaystyle D} , weighted by their posterior credences P [ T | D ] {\displaystyle \mathbb {P} [T|D]} . === Mathematical === The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every ϵ {\displaystyle \epsilon } > 0, there is some length l such that the probability of all programs longer than l is at most ϵ {\displaystyle \epsilon } . This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. == Mathematical guarantees == === Solomonoff's completeness === The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the (stochastic) data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process. === Solomonoff's uncomputability === Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem. == Modern applications == === Artificial intelligence === Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner (that is, recursive functional) which for any input of the form (f(0),f(1),...,f(n)) outputs a hypothesis (an index e with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of f). A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.

    Read more →
  • Termcap

    Termcap

    Termcap (terminal capability) is a legacy software library and database used on Unix-like computers that enables programs to use display computer terminals in a terminal-independent manner, which greatly simplifies the process of writing portable text mode applications. It was superseded by the terminfo database used by ncurses, tput, and other programs. A termcap database can describe the capabilities of hundreds of different display terminals. This allows programs to have character-based display output, independent of the type of terminal. On-screen text editors such as vi and Emacs are examples of programs that may use termcap. Other programs are listed in the Termcap category. Access to the termcap database was usually provided by separate libraries, e.g. GNU Termcap. Examples of what the database describes: how many columns wide the display is what string to send to move the cursor to an arbitrary position (including how to encode the row and column numbers) how to scroll the screen up one or several lines how much padding is needed for such a scrolling operation. == History == Bill Joy wrote the first termcap library in 1978 for the Berkeley Unix operating system; it has since been ported to most Unix and Unix-like environments, even OS-9. Joy's design was reportedly influenced by the design of the terminal data store in the earlier Incompatible Timesharing System. == Data model == Termcap databases consist of one or more descriptions of terminals. === Indices === Each description must contain the canonical name of the terminal. It may also contain one or more aliases for the name of the terminal. The canonical name or aliases are the keys by which the library searches the termcap database. === Data values === The description contains one or more capabilities, which have conventional names. The capabilities are typed: boolean, numeric and string. The termcap library has no predetermined type for each capability name. It determines the types of each capability by the syntax: string capabilities have an "=" between the capability name and its value, numeric capabilities have a "#" between the capability name and its value, and boolean capabilities have no associated value (they are always true if specified). Applications which use termcap do expect specific types for the commonly used capabilities, and obtain the values of capabilities from the termcap database using library calls that return successfully only when the database contents matches the assumed type. === Hierarchy === Termcap descriptions can be constructed by including the contents of one description in another, suppressing capabilities from the included description or overriding or adding capabilities. No matter what storage model is used, the termcap library constructs the terminal description from the requested description, including, suppressing or overriding at the time of the request. == Storage model == Termcap data is stored as text, making it simple to modify. The text can be retrieved by the termcap library from files or environment variables. === Environment variables === The TERM environment variable contains the terminal type name. The TERMCAP environment variable may contain a termcap database. It is most often used to store a single termcap description, set by a terminal emulator to provide the terminal's characteristics to the shell and dependent programs. The TERMPATH environment variable is supported by newer termcap implementations and defines a search path for termcap files. === Flat file === The original (and most common) implementation of the termcap library retrieves data from a flat text file. Searching a large termcap file, e.g., 500 kB, can be slow. To aid performance, a utility such as reorder is used to put the most frequently used entries near the beginning of the file. === Hashed database === 4.4BSD based implementations of termcap store the terminal description in a hashed database (e.g., something like Berkeley DB version 1.85). These store two types of records: aliases which point to the canonical entry, and the canonical entry itself. The text of the termcap entry is stored literally. == Limitations and extensions == The original termcap implementation was designed to use little memory: the first name is two characters, to fit in 16 bits capability names are two characters descriptions are limited to 1023 characters. only one termcap entry with its definitions can be included, and must be at the end. Newer implementations of the termcap interface generally do not require the two-character name at the beginning of the entry. Capability names are still two characters in all implementations. The tgetent function used to read the terminal description uses a buffer whose size must be large enough for the data, and is assumed to be 1024 characters. Newer implementations of the termcap interface may relax this constraint by allowing a null pointer in place of the fixed buffer, or by hiding the data which would not fit, e.g., via the ZZ capability in NetBSD termcap. The terminfo library interface also emulates the termcap interface, and does not actually use the fixed-size buffer. The terminfo library's emulation of termcap allows multiple other entries to be included without restricting the position. A few other newer implementations of the termcap library may also provide this ability, though it is not well documented. == Obsolete features == A special capability, the "hz" capability, was defined specifically to support the Hazeltine 1500 terminal, which had the unfortunate characteristic of using the ASCII tilde character ('~') as a control sequence introducer. In order to support that terminal, not only did code that used the database have to know about using the tilde to introduce certain control sequences, but it also had to know to substitute another printable character for any tildes in the displayed text, since a tilde in the text would be interpreted by the terminal as the start of a control sequence, resulting in missing text and screen garbling. Additionally, attribute markers (such as start and end of underlining) themselves took up space on the screen. Comments in the database source code often referred to this as "Hazeltine braindamage". Since the Hazeltine 1500 was a widely used terminal in the late 1970s, it was important for applications to be able to deal with its limitations.

    Read more →
  • Shape table

    Shape table

    Shape tables are a feature of the Apple II ROMs which allows for manipulation of small images encoded as a series of vectors. An image (or shape) can be drawn in the high-resolution graphics mode—with scaling and rotation—via software routines in the ROM. Shape tables are supported via Applesoft BASIC and from machine code in the "Programmer's Aid" package that was bundled with the original Integer BASIC ROMs for that computer. Applesoft's high-resolution graphics routines were not optimized for speed, so shape tables were not typically used for performance-critical software such as games, which were typically written in assembly language and used pre-shifted bitmap shapes. Shape tables were used primarily for static shapes and sometimes for fancy text; Beagle Bros offered a number of fonts in Font Mechanic as Applesoft shape tables. == Technical details == The vectors of a two-dimensional graphic, each encoding a direction from the previous pixel along with a flag indicating whether the new pixel should be illuminated or not, were encoded up to three in a byte. These were stored in a table via the Monitor or the POKE command. From there, the graphic could be referenced by number (a table could contain up to 255 shapes), and built-in Applesoft routines permitted scaling, rotating, and drawing or erasing the shape. An XOR mode was also available to allow the shape to be visible on any color background; this had the advantage, also, of allowing the shape to be easily erased by redrawing it. Apple did not provide any utilities for creating shape tables; they had to be created by hand, usually by plotting on graph paper, then calculating the hexadecimal values and entering them into the computer. Beagle Bros created a shape table editing program, which eliminated the "number crunching", called Apple Mechanic, and a related program, Font Mechanic.

    Read more →
  • Image

    Image

    An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a projection on a surface, activation of electronic signals, or digital displays; they can also be reproduced through mechanical means, such as photography, printmaking, or photocopying. Images can also be animated through digital or physical processes. In the context of signal processing, an image is a distributed amplitude of color(s). In optics, the term image (or optical image) refers specifically to the reproduction of an object formed by light waves coming from the object. A volatile image exists or is perceived only for a short period. This may be a reflection of an object by a mirror, a projection of a camera obscura, or a scene displayed on a cathode-ray tube. A fixed image, also called a hard copy, is one that has been recorded on a material object, such as paper or textile. A mental image exists in an individual's mind as something one remembers or imagines. The subject of an image does not need to be real; it may be an abstract concept such as a graph or function or an imaginary entity. For a mental image to be understood outside of an individual's mind, however, there must be a way of conveying that mental image through the words or visual productions of the subject. == Characteristics == === Two-dimensional images === The broader sense of the word 'image' also encompasses any two-dimensional figure, such as a map, graph, pie chart, painting, or banner. In this wider sense, images can also be rendered manually, such as by drawing, the art of painting, or the graphic arts (such as lithography or etching). Additionally, images can be rendered automatically through printing, computer graphics technology, or a combination of both methods. A two-dimensional image does not need to use the entire visual system to be a visual representation. An example of this is a grayscale ("black and white") image, which uses the visual system's sensitivity to brightness across all wavelengths without taking into account different colors. A black-and-white visual representation of something is still an image, even though it does not fully use the visual system's capabilities. On the other hand, some processes can be used to create visual representations of objects that are otherwise inaccessible to the human visual system. These include microscopy for the magnification of minute objects, telescopes that can observe objects at great distances, X-rays that can visually represent the interior structures of the human body (among other objects), magnetic resonance imaging (MRI), positron emission tomography (PET scans), and others. Such processes often rely on detecting electromagnetic radiation that occurs beyond the light spectrum visible to the human eye and converting such signals into recognizable images. === Three-dimensional images === Aside from sculpture and other physical activities that can create three-dimensional images from solid material, some modern techniques, such as holography, can create three-dimensional images that are reproducible but intangible to human touch. Some photographic processes can now render the illusion of depth in an otherwise "flat" image, but "3-D photography" (stereoscopy) or "3-D film" are optical illusions that require special devices such as eyeglasses to create the illusion of depth. === Moving images === "Moving" two-dimensional images are actually illusions of movement perceived when still images are displayed in sequence, each image lasting less, and sometimes much less, than a fraction of a second. The traditional standard for the display of individual frames by a motion picture projector has been 24 frames per second (FPS) since at least the commercial introduction of "talking pictures" in the late 1920s, which necessitated a standard for synchronizing images and sounds. Even in electronic formats such as television and digital image displays, the apparent "motion" is actually the result of many individual lines giving the impression of continuous movement. This phenomenon has often been described as "persistence of vision": a physiological effect of light impressions remaining on the retina of the eye for very brief periods. Even though the term is still sometimes used in popular discussions of movies, it is not a scientifically valid explanation. Other terms emphasize the complex cognitive operations of the brain and the human visual system. "Flicker fusion", the "phi phenomenon", and "beta movement" are among the terms that have replaced "persistence of vision", though no one term seems adequate to describe the process. == Cultural and other uses == Image-making seems to have been common to virtually all human cultures since at least the Paleolithic era. Prehistoric examples of rock art—including cave paintings, petroglyphs, rock reliefs, and geoglyphs—have been found on every inhabited continent. Many of these images seem to have served various purposes: as a form of record-keeping; as an element of spiritual, religious, or magical practice; or even as a form of communication. Early writing systems, including hieroglyphics, ideographic writing, and even the Roman alphabet, owe their origins in some respects to pictorial representations. === Meaning and signification === Images of any type may convey different meanings and sensations for individual viewers, regardless of whether the image's creator intended them. An image may be taken simply as a more or less "accurate" copy of a person, place, thing, or event. It may represent an abstract concept, such as the political power of a ruler or ruling class, a practical or moral lesson, an object for spiritual or religious veneration, or an object—human or otherwise—to be desired. It may also be regarded for its purely aesthetic qualities, rarity, or monetary value. Such reactions can depend on the viewer's context. A religious image in a church may be regarded differently than the same image mounted in a museum. Some might view it simply as an object to be bought or sold. Viewers' reactions will also be guided or shaped by their education, class, race, and other contexts. The study of emotional sensations and their relationship to any given image falls into the categories of aesthetics and the philosophy of art. While such studies inevitably deal with issues of meaning, another approach to signification was suggested by the American philosopher, logician, and semiotician Charles Sanders Peirce. "Images" are one type of the broad category of "signs" proposed by Peirce. Although his ideas are complex and have changed over time, the three categories of signs that he distinguished stand out: The "icon," which relates to an object by resemblance to some quality of the object. A painted or photographed portrait is an icon by virtue of its resemblance to the painting's or photograph's subject. A more abstract representation, such as a map or diagram, can also be an icon. The "index," which relates to an object by some real connection. For example, smoke may be an index of fire, or the temperature recorded on a thermometer may be an index of a patient's illness or health. The "symbol," which lacks direct resemblance or connection to an object but whose association is arbitrarily assigned by the creator or dictated by cultural and historical habit, convention, etc. The color red, for example, may connote rage, beauty, prosperity, political affiliation, or other meanings within a given culture or context; the Swedish film director Ingmar Bergman claimed that his use of the color in his 1972 film Cries and Whispers came from his personal visualization of the human soul. A single image may exist in all three categories at the same time. The Statue of Liberty provides an example. While there have been countless two-dimensional and three-dimensional "reproductions" of the statue (i.e., "icons" themselves), the statue itself exists as an "icon" by virtue of its resemblance to a human woman (or, more specifically, previous representations of the Roman goddess Libertas or the female model used by the artist Frederic-Auguste Bartholdi). an "index" representing New York City or the United States of America in general due to its placement in New York Harbor, or with "immigration" from its proximity to the immigration center at Ellis Island. a "symbol" as a visualization of the abstract concept of "liberty" or "freedom" or even "opportunity" or "diversity". === Critiques of imagery === The nature of images, whether three-dimensional or two-dimensional, created for a specific purpose or only for aesthetic pleasure, has continued to provoke questions and even condemnation at different times and places. In his dialogue, The Republic, the Greek philosopher Plato described our apparent reality as a copy of a higher order of universal forms.

    Read more →
  • Record sealing

    Record sealing

    Record sealing is the process of making public records inaccessible to the public. In many cases, a person with a sealed record gains the legal right to deny or not acknowledge anything to do with the arrest and the legal proceedings from the case itself. Records are commonly sealed in a number of situations: Sealed birth records (typically after adoption or determination of paternity) Juvenile criminal records may be sealed Other types of cases involving juveniles may be sealed, anonymized, or pseudonymized ("impounded"); e.g., child sex offense or custody cases Cases using witness protection information may be partly sealed Cases involving trade secrets Cases involving state secrets == Filing under seal in US court == Normally, records should not be filed under seal without a court permission. However, FRCP 5.2 requires that sensitive text – like Social Security number, Taxpayer Identification Number, birthday, bank accounts, and children’s names – should be redacted off the filings made with the court and accompanying exhibits. A person making a redacted filing can file an unredacted copy under seal, or the Court can choose to order later that an additional filing be made under seal without redaction. Alternately, the filing party may ask the court’s permission to file some exhibits completely under seal. When the document is filed "under seal", it should have a clear indication for the court clerk to file it separately – most often by stamping words "Filed Under Seal" on the bottom of each page. Person making filing should also provide instructions to the court clerk that the document needs to be filed "under seal". Courts often have specific requirements to these filings in their Local Rules. == Difference from expungement == Expungement, which is a physical destruction, namely a complete erasure of one's criminal records, and therefore usually carries a higher standard, differs from record sealing, which is only to restrict the public's access to records, so that only certain law enforcement agencies or courts, under special circumstances, will have access to them. A record seal will greatly improve the chance of employment, as employers will not have access to damning records. There are occasions, like expungement, where one can truthfully state under oath that they have never been convicted before. Most of the time, a record seal has more relaxed requirements than an expungement. If an expungement is not allowed with a case, then sealing a record may be the best bet. Different states have different terms for what constitutes sealing of a record. == Cybersecurity incidents involving sealed records == Several cybersecurity incidents have demonstrated that sealed court documents are not always secure in practice, with vulnerabilities and data breaches exposing sensitive information. In January 2021, following the SolarWinds cyber attack, the U.S. Bankruptcy Court United States District Court for the District of Nevada announced that its Case Management/Electronic Case Files CM/ECF system had been potentially compromised. The judiciary stated that additional safeguards were being implemented to protect filings, and that the review of the incident and its impact was ongoing. Reports noted that the breach raised concerns about exposure of highly sensitive and sealed documents submitted through the CM/ECF system. In 2023, security researcher Jason Parker, following a tip from an activist, identified flaws in online court systems that exposed sealed records including confidential testimony and medical records through publicly accessible portals. In 2024, a cyber intrusion targeting attorneys in a civil case involving Representative Matt Gaetz led to the unauthorized access and leak of sealed depositions and related records. The breach exposed confidential testimony and financial records, some of which were later reported by news outlets, raising concerns about the security of electronically stored legal materials and the handling of sealed filings. In 2025, multiple reports confirmed that the federal judiciary's CM/ECF and PACER (law) filing system was compromised, exposing sealed indictments, confidential informant information, and other sensitive filings. Some courts temporarily reverted to paper-based filing to mitigate the risks of further disclosure. The FBI later confirmed that the breach had exposed sealed records, and investigators suspected foreign state actors were involved. == GAO publications referencing sealed records == Closed Criminal Plea and Sentencing Proceedings (1983) – Reviewed Department of Justice policies on closing plea and sentencing hearings. GAO noted that sealed transcripts should be unsealed once the reasons for closure no longer applied. Information on Plea Agreements and Settlements in Defense Procurement Fraud Cases (1992) – Examined outcomes of procurement fraud prosecutions. GAO observed that in some instances the results were sealed from public access. Military Recruiting: More Needs to Be Done to Better Screen Applicants and Detect Fraud (1999) – Investigated fraudulent enlistments in the armed forces. The report highlighted that sealed juvenile records often prevented recruiters from discovering prior offenses. Social Security Numbers: Governments Could Do More to Reduce Display in Public Records (2004) – Analyzed risks associated with SSN availability in state and local records. GAO pointed out that some categories of records, such as adoption proceedings, were sealed and less likely to expose identifiers. Social Security Numbers: Stronger Safeguards Needed to Protect Privacy (2005 testimony) – Testimony before Congress reiterating concerns over SSN exposure in public records, while noting that sealed categories (e.g., adoption) were exceptions. U.S. Supreme Court: Policies and Perspectives on Video and Audio Coverage of Appellate Court Proceedings (2016) – Surveyed appellate court policies on courtroom media coverage. The report acknowledged distinctions between public filings, confidential submissions, and sealed materials. Evictions: National Data Are Limited and Challenging to Collect (2024) – Examined nationwide eviction data. GAO reported that in some states eviction records may be sealed or expunged, limiting researchers' ability to compile datasets. DOD Fraud Risk Management: Enhanced Data and Collaboration Could Improve Efforts (2024) – Reviewed Department of Defense fraud-risk management. GAO noted that some adjudicative records in its dataset were sealed, restricting completeness of oversight data.

    Read more →
  • Molecular graphics

    Molecular graphics

    Molecular graphics is the discipline and philosophy of studying molecules and their properties through graphical representation. IUPAC limits the definition to representations on a "graphical display device". Ever since Dalton's atoms and Kekulé's benzene, there has been a rich history of hand-drawn atoms and molecules, and these representations have had an important influence on modern molecular graphics. Colour molecular graphics are often used on chemistry journal covers artistically. == History == Prior to the use of computer graphics in representing molecular structure, Robert Corey and Linus Pauling developed a system for representing atoms or groups of atoms from hard wood on a scale of 1 inch = 1 angstrom connected by a clamping device to maintain the molecular configuration. These early models also established the CPK coloring scheme that is still used today to differentiate the different types of atoms in molecular models (e.g. carbon = black, oxygen = red, nitrogen = blue, etc). This early model was improved upon in 1966 by W.L. Koltun and are now known as Corey-Pauling-Koltun (CPK) models. The earliest efforts to produce models of molecular structure was done by Project MAC using wire-frame models displayed on a cathode ray tube in the mid 1960s. In 1965, Carroll Johnson distributed the Oak Ridge thermal ellipsoid plot (ORTEP) that visualized molecules as a ball-and-stick model with lines representing the bonds between atoms and ellipsoids to represent the probability of thermal motion. Thermal ellipsoid plots quickly became the de facto standard used in the display of X-ray crystallography data, and are still in wide use today. The first practical use of molecular graphics was a simple display of the protein myoglobin using a wireframe representation in 1966 by Cyrus Levinthal and Robert Langridge working at Project MAC. Among the milestones in high-performance molecular graphics was the work of Nelson Max in "realistic" rendering of macromolecules using reflecting spheres. Initially much of the technology concentrated on high-performance 3D graphics. During the 1970s, methods for displaying 3D graphics using cathode ray tubes were developed using continuous tone computer graphics in combination with electro-optic shutter viewing devices. The first devices used an active shutter 3D system, generating different perspective views for the left and right channel to provide the illusion of three-dimensional viewing. Stereoscopic viewing glasses were designed using lead lanthanum zirconate titanate (PLZT) ceramics as electronically controlled shutter elements. Active 3D glasses require batteries and work in concert with the display to actively change the presentation by the lenses to the wearer's eyes. Many modern 3D glasses use a passive, polarized 3D system that enables the wearer to visualize 3D effects based on their own perception. Passive 3D glasses are more common today since they are less expensive. The requirements of macromolecular crystallography also drove molecular graphics because the traditional techniques of physical model-building could not scale. The first two protein structures solved by molecular graphics without the aid of the Richards' Box were built with Stan Swanson's program FIT on the Vector General graphics display in the laboratory of Edgar Meyer at Texas A&M University: First Marge Legg in Al Cotton's lab at A&M solved a second, higher-resolution structure of staph. nuclease (1975) and then Jim Hogle solved the structure of monoclinic lysozyme in 1976. A full year passed before other graphics systems were used to replace the Richards' Box for modelling into density in 3-D. Alwyn Jones' FRODO program (and later "O") were developed to overlay the molecular electron density determined from X-ray crystallography and the hypothetical molecular structure. === Timeline === == Types == === Ball-and-stick models === In the ball-and-stick model, atoms are drawn as small sphered connected by rods representing the chemical bonds between them. === Space-filling models === In the space-filling model, atoms are drawn as solid spheres to suggest the space they occupy, in proportion to their van der Waals radii. Atoms that share a bond overlap with each other. === Surfaces === In some models, the surface of the molecule is approximated and shaded to represent a physical property of the molecule, such as electronic charge density. === Ribbon diagrams === Ribbon diagrams are schematic representations of protein structure and are one of the most common methods of protein depiction used today. The ribbon shows the overall path and organization of the protein backbone in 3D, and serves as a visual framework on which to hang details of the full atomic structure, such as the balls for the oxygen atoms bound to the active site of myoglobin in the adjacent image. Ribbon diagrams are generated by interpolating a smooth curve through the polypeptide backbone. α-helices are shown as coiled ribbons or thick tubes, β-strands as arrows, and non-repetitive coils or loops as lines or thin tubes. The direction of the polypeptide chain is shown locally by the arrows, and may be indicated overall by a colour ramp along the length of the ribbon.

    Read more →
  • User-defined function

    User-defined function

    A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its creator. == BASIC language == In some old implementations of the BASIC programming language, user-defined functions are defined using the "DEF FN" syntax. More modern dialects of BASIC are influenced by the structured programming paradigm, where most or all of the code is written as user-defined functions or procedures, and the concept becomes practically redundant. == COBOL language == In the COBOL programming language, a user-defined function is an entity that is defined by the user by specifying a FUNCTION-ID paragraph. A user-defined function must return a value by specifying the RETURNING phrase of the procedure division header and they are invoked using the function-identifier syntax. See the ISO/IEC 1989:2014 Programming Language COBOL standard for details. As of May 2022, the IBM Enterprise COBOL for z/OS 6.4 (IBM COBOL) compiler contains support for user-defined functions. == Databases == In relational database management systems, a user-defined function provides a mechanism for extending the functionality of the database server by adding a function, that can be evaluated in standard query language (usually SQL) statements. The SQL standard distinguishes between scalar and table functions. A scalar function returns only a single value (or NULL), whereas a table function returns a (relational) table comprising zero or more rows, each row with one or more columns. User-defined functions in SQL are declared using the CREATE FUNCTION statement. For example, a user-defined function that converts Celsius to Fahrenheit (a temperature scale used in USA) might be declared like this: Once created, a user-defined function may be used in expressions in SQL statements. For example, it can be invoked where most other intrinsic functions are allowed. This also includes SELECT statements, where the function can be used against data stored in tables in the database. Conceptually, the function is evaluated once per row in such usage. For example, assume a table named Elements, with a row for each known chemical element. The table has a column named BoilingPoint for the boiling point of that element, in Celsius. The query would retrieve the name and the boiling point from each row. It invokes the CtoF user-defined function as declared above in order to convert the value in the column to a value in Fahrenheit. Each user-defined function carries certain properties or characteristics. The SQL standard defines the following properties: Language - defines the programming language in which the user-defined function is implemented; examples include SQL, C, C# and Java. Parameter style - defines the conventions that are used to pass the function parameters and results between the implementation of the function and the database system (only applicable if language is not SQL). Specific name - a name for the function that is unique within the database. Note that the function name does not have to be unique, considering overloaded functions. Some SQL implementations require that function names are unique within a database, and overloaded functions are not allowed. Determinism - specifies whether the function is deterministic or not. The determinism characteristic has an influence on the query optimizer when compiling a SQL statement. SQL-data access - tells the database management system whether the function contains no SQL statements (NO SQL), contains SQL statements but does not access any tables or views (CONTAINS SQL), reads data from tables or views (READS SQL DATA), or actually modifies data in the database (MODIFIES SQL DATA). User-defined functions should not be confused with stored procedures. Stored procedures allow the user to group a set of SQL commands. A procedure can accept parameters and execute its SQL statements depending on those parameters. A procedure is not an expression and, thus, cannot be used like user-defined functions. Some database management systems allow the creation of user defined functions in languages other than SQL. Microsoft SQL Server, for example, allows the user to use .NET languages including C# for this purpose. DB2 and Oracle support user-defined functions written in C or Java programming languages. === SQL Server 2000 === There are three types of UDF in Microsoft SQL Server 2000: scalar functions, inline table-valued functions, and multistatement table-valued functions. Scalar functions return a single data value (not a table) with RETURNS clause. Scalar functions can use all scalar data types, with exception of timestamp and user-defined data types. Inline table-valued functions return the result set of a single SELECT statement. Multistatement table-valued functions return a table, which was built with many TRANSACT-SQL statements. User-defined functions can be invoked from a query like built‑in functions such as OBJECT_ID, LEN, DATEDIFF, or can be executed through an EXECUTE statement like stored procedures. Performance Notes: User-defined functions are subroutines made of one or more Transact-SQL statements that can be used to encapsulate code for reuse. It takes zero or more arguments and evaluates a return value. Has both control-flow and DML statements in its body similar to stored procedures. Does not allow changes to any Global Session State, like modifications to database or external resource, such as a file or network. Does not support output parameter. DEFAULT keyword must be specified to pass the default value of parameter. Errors in UDF cause UDF to abort which, in turn, aborts the statement that invoked the UDF. === Apache Hive === Apache Hive defines, in addition to the regular user-defined functions (UDF), also user-defined aggregate functions (UDAF) and table-generating functions (UDTF). Hive enables developers to create their own custom functions with Java. === Apache Doris === Apache Doris, an open-source real-time analytical database, allows external users to contribute their own UDFs written in C++ to it.

    Read more →
  • Real-time computer graphics

    Real-time computer graphics

    Real-time computer graphics or real-time rendering is the sub-field of computer graphics focused on producing and analyzing images in real time. The term can refer to anything from rendering an application's graphical user interface (GUI) to real-time image analysis, but is most often used in reference to interactive 3D computer graphics, typically using a graphics processing unit (GPU). One example of this concept is a video game that rapidly renders changing 3D environments to produce an illusion of motion. Computers have been capable of generating 2D images such as simple lines, images and polygons in real time since their invention. However, quickly rendering detailed 3D objects is a daunting task for traditional Von Neumann architecture-based systems. An early workaround to this problem was the use of sprites, 2D images that could imitate 3D graphics. Different techniques for rendering now exist, such as ray-tracing and rasterization. Using these techniques and advanced hardware, computers can now render images quickly enough to create the illusion of motion while simultaneously accepting user input. This means that the user can respond to rendered images in real time, producing an interactive experience. == Principles of real-time 3D computer graphics == The goal of computer graphics is to generate computer-generated images, or frames, using certain desired metrics. One such metric is the number of frames generated in a given second. Real-time computer graphics systems differ from traditional (i.e., non-real-time) rendering systems in that non-real-time graphics typically rely on ray tracing. In this process, millions or billions of rays are traced from the camera to the world for detailed rendering—this expensive operation can take hours or days to render a single frame. Real-time graphics systems must render each image in less than 1/30th of a second. Ray tracing is far too slow for these systems; instead, they employ the technique of z-buffer triangle rasterization. In this technique, every object is decomposed into individual primitives, usually triangles. Each triangle gets positioned, rotated and scaled on the screen, and rasterizer hardware (or a software emulator) generates pixels inside each triangle. These triangles are then decomposed into atomic units called fragments that are suitable for displaying on a display screen. The fragments are drawn on the screen using a color that is computed in several steps. For example, a texture can be used to "paint" a triangle based on a stored image, and then shadow mapping can alter that triangle's colors based on line-of-sight to light sources. === Video game graphics === Real-time graphics optimizes image quality subject to time and hardware constraints. GPUs and other advances increased the image quality that real-time graphics can produce. GPUs are capable of handling millions of triangles per frame, and modern DirectX/OpenGL class hardware is capable of generating complex effects, such as shadow volumes, motion blurring, and triangle generation, in real-time. The advancement of real-time graphics is evidenced in the progressive improvements between actual gameplay graphics and the pre-rendered cutscenes traditionally found in video games. Cutscenes are typically rendered in real-time—and may be interactive. Although the gap in quality between real-time graphics and traditional off-line graphics is narrowing, offline rendering remains much more accurate. === Advantages === Real-time graphics are typically employed when interactivity (e.g., player feedback) is crucial. When real-time graphics are used in films, the director has complete control of what has to be drawn on each frame, which can sometimes involve lengthy decision-making. Teams of people are typically involved in the making of these decisions. In real-time computer graphics, the user typically operates an input device to influence what is about to be drawn on the display. For example, when the user wants to move a character on the screen, the system updates the character's position before drawing the next frame. Usually, the display's response-time is far slower than the input device—this is justified by the immense difference between the (fast) response time of a human being's motion and the (slow) perspective speed of the human visual system. This difference has other effects too: because input devices must be very fast to keep up with human motion response, advancements in input devices (e.g., the current Wii remote) typically take much longer to achieve than comparable advancements in display devices. Another important factor controlling real-time computer graphics is the combination of physics and animation. These techniques largely dictate what is to be drawn on the screen—especially where to draw objects in the scene. These techniques help realistically imitate real world behavior (the temporal dimension, not the spatial dimensions), adding to the computer graphics' degree of realism. Real-time previewing with graphics software, especially when adjusting lighting effects, can increase work speed. Some parameter adjustments in fractal generating software may be made while viewing changes to the image in real time. == Rendering pipeline == The graphics rendering pipeline ("rendering pipeline" or simply "pipeline") is the foundation of real-time graphics. Its main function is to render a two-dimensional image in relation to a virtual camera, three-dimensional objects (an object that has width, length, and depth), light sources, lighting models, textures and more. === Architecture === The architecture of the real-time rendering pipeline can be divided into conceptual stages: application, geometry and rasterization. === Application stage === The application stage is responsible for generating "scenes", or 3D settings that are drawn to a 2D display. This stage is implemented in software that developers optimize for performance. This stage may perform processing such as collision detection, speed-up techniques, animation and force feedback, in addition to handling user input. Collision detection is an example of an operation that would be performed in the application stage. Collision detection uses algorithms to detect and respond to collisions between (virtual) objects. For example, the application may calculate new positions for the colliding objects and provide feedback via a force feedback device such as a vibrating game controller. The application stage also prepares graphics data for the next stage. This includes texture animation, animation of 3D models, animation via transforms, and geometry morphing. Finally, it produces primitives (points, lines, and triangles) based on scene information and feeds those primitives into the geometry stage of the pipeline. === Geometry stage === The geometry stage manipulates polygons and vertices to compute what to draw, how to draw it and where to draw it. Usually, these operations are performed by specialized hardware or GPUs. Variations across graphics hardware mean that the "geometry stage" may actually be implemented as several consecutive stages. ==== Model and view transformation ==== Before the final model is shown on the output device, the model is transformed onto multiple spaces or coordinate systems. Transformations move and manipulate objects by altering their vertices. Transformation is the general term for the four specific ways that manipulate the shape or position of a point, line or shape. ==== Lighting ==== In order to give the model a more realistic appearance, one or more light sources are usually established during transformation. However, this stage cannot be reached without first transforming the 3D scene into view space. In view space, the observer (camera) is typically placed at the origin. If using a right-handed coordinate system (which is considered standard), the observer looks in the direction of the negative z-axis with the y-axis pointing upwards and the x-axis pointing to the right. ==== Projection ==== Projection is a transformation used to represent a 3D model in a 2D space. The two main types of projection are orthographic projection (also called parallel) and perspective projection. The main characteristic of an orthographic projection is that parallel lines remain parallel after the transformation. Perspective projection utilizes the concept that if the distance between the observer and model increases, the model appears smaller than before. Essentially, perspective projection mimics human sight. ==== Clipping ==== Clipping is the process of removing primitives that are outside of the view box in order to facilitate the rasterizer stage. Once those primitives are removed, the primitives that remain will be drawn into new triangles that reach the next stage. ==== Screen mapping ==== The purpose of screen mapping is to find out the coordinates of the primitives during the clipping stage. ==== Rasterizer stage ==== The rasterizer

    Read more →
  • Hallucination (artificial intelligence)

    Hallucination (artificial intelligence)

    In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where a hallucination typically involves false percepts. For example, a chatbot powered by large language models (LLMs), like ChatGPT, may embed plausible-sounding random falsehoods within its generated content. Detecting and mitigating errors and hallucinations pose significant challenges for practical deployment and reliability of LLMs in high-stakes scenarios, such as chip design, supply chain logistics, and medical diagnostics. Some software engineers and statisticians have criticized the specific term "AI hallucination" for unreasonably anthropomorphizing computers. Symbolic artificial intelligence models generally do not produce hallucinations, unlike large language models. == Term == === Origin === Since the 1980s, the term "hallucination" has been used in computer vision with a positive connotation to describe the process of adding detail to an image. For example, the task of generating high-resolution face images from low-resolution inputs is called face hallucination. The first documented use of the term "hallucination" in this sense is in the PhD thesis of Eric Mjolsness in 1986. A notable work is the face hallucination algorithm by Simon Baker and Takeo Kanade published in 1999. In the 2000s, hallucinations were described in statistical machine translation as a failure mode. Since the 2010s, the term has undergone a semantic shift to signify the generation of factually incorrect or misleading outputs by AI systems in tasks like machine translation and object detection. In 2015, hallucinations were identified in visual semantic role labeling tasks by Saurabh Gupta and Jitendra Malik. In 2015, computer scientist Andrej Karpathy used the term "hallucinated" in a blog post to describe his recurrent neural network (RNN) language model generating an incorrect citation link. In 2017, Google researchers used the term to describe the responses generated by neural machine translation (NMT) models when they are not related to the source text, and in 2018, the term was used in computer vision to describe instances where non-existent objects are erroneously detected because of adversarial attacks. In July 2021, Meta warned during its release of BlenderBot 2 that the system is prone to "hallucinations", which Meta defined as "confident statements that are not true". Following OpenAI's ChatGPT release in beta version in November 2022, some users complained that such chatbots often seem to pointlessly embed plausible-sounding random falsehoods within their generated content. Many news outlets, including The New York Times, started to use the term "hallucinations" to describe these models' frequently incorrect or inconsistent responses. In 2023, the Cambridge dictionary updated its definition of hallucination to include this new sense specific to the field of AI. Some researchers have highlighted a lack of consistency in how the term is used, but also identified several alternative terms in the literature, such as confabulations, fabrications, and factual errors. === Definitions and alternatives === Uses, definitions and characterizations of the term "hallucination" in the context of LLMs include: "a tendency to invent facts in moments of uncertainty" (OpenAI, May 2023) "a model's logical mistakes" (OpenAI, May 2023) "fabricating information entirely, but behaving as if spouting facts" (CNBC, May 2023) "making up information" (The Verge, February 2023) "probability distributions" (in scientific contexts) Journalist Benj Edwards, in Ars Technica, writes that the term "hallucination" is controversial, but that some form of metaphor remains necessary; Edwards suggests "confabulation" as an analogy for processes that involve "creative gap-filling". In July 2024, a White House report on fostering public trust in AI research mentioned hallucinations only in the context of reducing them. Notably, when acknowledging David Baker's Nobel Prize-winning work with AI-generated proteins, the Nobel committee avoided the term entirely, instead referring to "imaginative protein creation". Hicks, Humphries, and Slater, in their article in Ethics and Information Technology, argue that the output of LLMs is "bullshit" under Harry Frankfurt's definition of the term, and that the models are "in an important way indifferent to the truth of their outputs", with true statements only accidentally true, and false ones accidentally false. Some researchers also use the derogatory term "botshit", often referring to uncritical use of AI. === Criticism === In the scientific community, some researchers avoid the term "hallucination", seeing it as potentially misleading. It has been criticized by Usama Fayyad, executive director of the Institute for Experimental Artificial Intelligence at Northeastern University, on the grounds that it misleadingly personifies large language models and is vague. Mary Shaw said, "The current fashion for calling generative AI's errors 'hallucinations' is appalling. It anthropomorphizes the software, and it spins actual errors as somehow being idiosyncratic quirks of the system even when they're objectively incorrect." In Salon, statistician Gary Smith argues that LLMs "do not understand what words mean" and consequently that the term "hallucination" unreasonably anthropomorphizes the machine. Murray Shanahan argues that anthropomorphic framing of LLM capabilities, including terms like "hallucination", encourages users and researchers to attribute cognitive processes to systems that operate through statistical pattern completion, and advocates for more careful linguistic practices when discussing LLM behavior. Kristina Šekrst argues that applying psychological vocabulary to LLM outputs obscures the difference between the appearance of mental properties and their genuine presence. Förster & Skop assert that tech companies use the hallucination metaphor to anthropomorphize models and deflect responsibility for non-factual outputs. Some see the AI outputs not as illusory but as prospective—that is, having some chance of being true, similar to early-stage scientific conjectures. The term has also been criticized for its association with psychedelic drug experiences. == In natural language generation == In natural language generation, there are several reasons why natural language models hallucinate: === Hallucination from data === Hallucinations can stem from incomplete, inaccurate or unrepresentative data sets. === Modeling-related causes === The pre-training of generative pretrained transformers (GPT) involves predicting the next word. It incentivizes GPT models to "give a guess" about what the next word is, even when they lack information. Some researchers take an anthropomorphic perspective and posit that hallucinations arise from a tension between novelty and usefulness. For instance, Amabile and Pratt define human creativity as the production of novel and useful ideas. By extension, a focus on novelty in machine creativity can lead to the production of original but inaccurate responses—that is, falsehoods—whereas a focus on usefulness may result in memorized content lacking originality. By 2022, newspapers such as The New York Times expressed concern that, as the adoption of bots based on large language models continued to grow, unwarranted user confidence in bot output could lead to problems. === Interpretability research === In 2025, interpretability research by Anthropic on the LLM Claude identified internal circuits that cause it to decline to answer questions unless it knows the answer. By default, the circuit is active and the LLM doesn't answer. When the LLM has sufficient information, these circuits are inhibited and the LLM answers the question. Hallucinations were found to occur when this inhibition happens incorrectly, such as when Claude recognizes a name but lacks sufficient information about that person, causing it to generate plausible but untrue responses. === Examples === On 15 November 2022, researchers from Meta AI published Galactica, designed to "store, combine and reason about scientific knowledge". Content generated by Galactica came with the warning: "Outputs may be unreliable! Language Models are prone to hallucinate text." In one case, when asked to draft a paper on creating avatars, Galactica cited a fictitious paper from a real author who works in the relevant area. Meta withdrew Galactica on 17 November due to offensiveness and inaccuracy. OpenAI's ChatGPT, released in beta version to the public on November 30, 2022, was based on the foundation model GPT-3.5 (a revision of GPT-3). Professor Ethan Mollick of Wharton called it an "omniscient, eager-to-please intern who sometimes lies to you". Data scientist Teresa Kuba

    Read more →
  • Groover

    Groover

    Groover is an online platform, record label and distributor, connecting artists and musicians with music professionals and media outlets. The service was founded in 2018 in France and operates from offices in Paris and New York. The platform has over 3,000 active contacts, including SPIN Magazine and Sofar Sounds. Groover uses a micro-payment model. Among the platform's over 500,000 regular users are record labels such as Ninja Tune, Ba Da Bing Records, Dance To The Radio, Roche Musique, Wagram Music, Secret City Records, and artists including Bonobo, Michael Bolton, Aloe Blacc, Haddaway, Passenger, La Femme and Chinese Man. == History == Groover was launched at the MaMA Music Convention in October 2018. It was co-founded by Dorian Perron, Romain Palmieri, and Rafaël Cohen while they were students at UC Berkeley. Initially growing in France, the company has expanded to the United States, Canada, the United Kingdom, Brazil, Italy, and elsewhere in Europe. In March 2019, Groover was part of the Business France delegation at the South by Southwest (SXSW) festival. In June 2019, Groover raised €1.3 million from various angel investors. In April 2021, Groover acquired the platform Soonvibes, which had 70,000 users at the time, in order to strengthen its community in the electronic music space. In November 2021, Groover announced a €6 million funding round from Bpifrance Creative Industries and Partech. Between 2023 and 2025, Groover entered strategic partnerships with major artist service providers, including CD Baby, TuneCore, SoundCloud, UnitedMasters, Symphonic Distribution, Audiomack and SACEM. In February 2024, Groover announced a Series A funding round of $8 million from OneRagTime, Trind, Techmind, and Mozza Angels. == Function == Using a micro-payment system, professionals listen to tracks and provide written feedback. These professionals retain full editorial independence and are under no obligation to share the track or contact the artist. == Awards == 2nd Prize for Music Innovation 2023 from the Centre national de la musique (France) "Future Creator" Award at the Petit Poucet Competition 2019 Jury's Special Mention at the MaMA Invent 2019 competition 1st Prize for Digital Initiative in Culture, Communication & Media 2019 awarded by Audiens "Start-up of the Year" at the Social Music Awards 2020 French American Entrepreneurship Award 2022 at the French Consulate in New York

    Read more →
  • Vulnerability Discovery Model

    Vulnerability Discovery Model

    A Vulnerability Discovery Model (VDM) uses discovery event data with software reliability models for predicting the same. A thorough presentation of VDM techniques is available in. Numerous model implementations are available in the MCMCBayes open source repository. Several VDM examples include: Alhazmi-Malaiya: Time based model (Alhazmi-Malaiya Logistic (AML) model) Alhazmi-Malaiya: Effort based model Rescorla: Quadratic Model and Exponential Model Anderson: Thermodynamic Model Kim: Weibull Model Linear Model Hump-Shaped Model Independent and Dependent Model Vulnerability Discovery Modeling using Bayesian model averaging Multivariate Vulnerability Discovery Models

    Read more →
  • Cowrie (honeypot)

    Cowrie (honeypot)

    Cowrie is a medium interaction SSH and Telnet honeypot designed to log brute force attacks and shell interaction performed by an attacker. Cowrie also functions as an SSH and telnet proxy to observe attacker behavior to another system. Cowrie was developed from Kippo. == Reception == Cowrie has been referenced in published papers. The Book "Hands-On Ethical Hacking and Network Defense" includes Cowrie in a list of 5 commercial honeypots. === Prior uses === Discussing a honeypot effort called the Project Heisenberg Cloud by Rapid7, Bob Rudis, the company's chief data scientist, told eWEEK, "There are custom Rapid7-developed low- and medium-interaction honeypots used within the framework, along with open-source ones, such as Cowrie." Doug Rickert has experimented with the open-source Cowrie SSH honeypot and wrote about it on Medium. Putting up a simple honeypot isn't difficult, and there are many open-source products besides Cowrie, including the original Honeyd to MongoDB and NoSQL honeypots, to ones that emulate web servers. Some appear to be SCADA or other more advanced applications. === Best practices === Researchers at the SysAdmin, Audit, Network and Security (SANS) institute urged administrators and security researchers to run the latest version of Cowrie on a honeypot to monitor shifts in the type of passwords being scanned for and pattern of attacks on IoT devices. === Discussion and further resources === Attack Detection and Forensics Using Honeypot in an IoT Environment calls Cowrie a "medium interaction honeypot" and describes results from using it for 40 days to capture "all communicated sessions in log files." The book Advances on Data Science also devotes chapter two to "Cowrie Honeypot Dataset and Logging." ICCWS 2018 13th International Conference on Cyber Warfare and Security describes using Cowrie. On the Move to Meaningful Internet Systems: OTM 2019 Conferences includes details of using Cowrie. Splunk, a security tool that can receive information from honeypots, outlines how to set up a honeypot using the open-source Cowrie package.

    Read more →
  • Compute (machine learning)

    Compute (machine learning)

    In machine learning and deep learning, compute is the amount of computing power or computational resources required to train machine learning models and large language models. More broadly, compute is the computational power or resources necessary for a computer or computer program to function. == Definition == Compute is commonly defined as the amount of computing power or computational resources required to train machine learning and large language models. The term "compute" has also been more broadly applied to cloud computing, referencing processing power, memory, networking, storage, and other resources required for the computation of any program. Compute is measured in petaflop/s-days and is used to document AI training. A petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations. The compute-time product serves as a mental convenience, similar to kilowatt-hour for energy. An amount of compute is meant to give an idea of the number of actual operations performed. == History == In a 2018 analysis titled "AI and compute", artificial intelligence company OpenAI introduced the concept of compute. OpenAI identified two eras of training AI systems in terms of compute-usage. From 1959 to 2012, compute roughly followed Moore’s law. Between 2012 and 2018, the amount of compute used in the largest AI training runs increased exponentially, growing by more than 300,000 times — roughly doubling every 3.4 months. By comparison, Moore’s Law doubled every two years over the same period. One of the largest models, released in 2020, used 600,000 times more computing power than the 2012 model. After 2020, compute growth began to slow down, with the compute needed for the largest AI models continuing to slow down in 2023. The notion of compute has become increasingly used from the mid-2020s onwards. == Compute growth and AI progress == Larger AI models trained on more data and using more computational resources, tend to perform better. This happens even if the algorithms themselves remain unchanged. As early as 2018, OpenAI noted the exponential increase in compute to be have a key role in AI progress. OpenAI considers three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training. AI models with more compute not only improve in the tasks they were trained on but can develop emergent abilities. Incremental improvements can lead to more abrupt leaps in capabilities. AI provider SpaceXAI said in 2026 that their AI progress is driven by compute and used it a key metric in the AI training of its supercomputer Colossus, the which contains 1 million GPUs. Anthropic has a contract of $1.25 billion per month with SpaceXAI to buy all the compute capacity at Colossus 1 data center. === Criticism and policy === Increasing, promoting or constraining progress in artificial intelligence has often be done via controlling the amount of compute. Policymarkers have enacted policies and provided support to make compute resources more accessible to domestic AI researchers. In a January 2022 report, the Center for Security and Emerging Technology (CSET) suggested to institutions that increasingly powerful and generalizable AI (AGI) will likely require other strategies than maximizing compute. Some AI researchers are also concerned that government might exclusively focus on scaling compute instead of other strategies. The CSET has reported on the various bottlenecks which could explain why deep learning needs for compute have slow down: training is expensive and training extremely large models generates traffic jams across many processors that are difficult to manage. there is a limited supply of AI chips (see AI chip memory shortage). CSET advances that the main resource is human capital, specifically talented researchers — according to a 2023 published survey of more than 400 AI researchers, academic and private sector workers. The survey found that AI researchers are not primarily or exclusively constrained by compute access. However, both academic and industry AI researchers equally report concerns that insufficient compute could prevent them from contributing meaningfully to AI research in the future. High compute users are more concerned about compute access. When asked about which resource provided by the government would be the most useful to them, some AI researchers select compute, other prefer grant funding. For this goal, CSET advised policymakers to ensure that even researchers with smaller budgets could effectively contribute to AI research. Other proposed strategies include using contemporary AI algorithms, managing modern AI infrastructure or focusing on interdisciplinary work between the AI field and other fields of computer science. A 2024 study on compute access found that academic-only AI research teams often have less compute intensive research topics, especially foundation models, compared to industry AI labs. As a consequence, academia is likely to play a smaller role in advancing such techniques. The researchers suggest nationally-sponsored computing infrastructure as well as open science initiatives to boost academic compute access. === Data === A 2022 study found that current large language models are significantly under-trained, a consequence of focusing on scaling language models whilst keeping the amount of training data constant. By training over 400 language models of various parameter and token size, they found that "for compute-optimal training", the model size and the number of training tokens should ideally be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

    Read more →
  • Autocommit

    Autocommit

    In the context of data management, autocommit is a mode of operation of a database connection. Each individual database interaction (i.e., each SQL statement) submitted through the database connection in autocommit mode will be executed in its own transaction that is implicitly committed. A SQL statement executed in autocommit mode cannot be rolled back. Autocommit mode incurs per-statement transaction overhead and can often lead to undesirable performance or resource utilization impact on the database. Nonetheless, in systems such as Microsoft SQL Server, as well as connection technologies such as ODBC and Microsoft OLE DB, autocommit mode is the default for all statements that change data, in order to ensure that individual statements will conform to the ACID (atomicity-consistency-isolation-durability) properties of transactions. The alternative to autocommit mode (non-autocommit) means that the SQL client application itself is responsible for ending transactions explicitly via the commit or rollback SQL commands. Non-autocommit mode enables grouping of multiple data manipulation SQL commands into a single atomic transaction. Some DBMS (e.g. MariaDB) force autocommit for every DDL statement, even in non-autocommit mode. In this case, before each DDL statement, previous DML statements in transaction are autocommitted. Each DDL statement is executed in its own new autocommit transaction.

    Read more →
  • Telebirr

    Telebirr

    Telebirr (Amharic: ቴሌብር) is a mobile payment service developed and was launched by Ethio telecom, the state owned telecommunication and Internet service provider in Ethiopia. It took five months to develop the end-to-end service. It facilitates the delivery of cashless transactions. The platform deployed currently has the capacity of processing up to 100 transactions per second (TPS) and can be scaled up to 1000 TPS. The service is accessible via SMS, USSD, and smartphone applications. Telebirr works in five languages. == Services == Though the service is fully accessible for any customer of Ethio telecom, the users need to register through the mobile application called Telebirr or using an authorized agent or Ethio telecom shop or Unstructured Supplementary Service Data (USSD), 127# nationally. However, Telebirr also provides a “quick registration” by using any information that already exists in Ethio telecom's system.

    Read more →