AI Headshot Generator

AI Headshot Generator — hands-on reviews, top picks, pricing, pros and cons and a practical how-to guide on Aizhi.

  • Anomaly Detection at Multiple Scales

    Anomaly Detection at Multiple Scales

    Anomaly Detection at Multiple Scales, or ADAMS was a $35 million DARPA project designed to identify patterns and anomalies in very large data sets. It is under DARPA's Information Innovation office and began in 2011 and ended in August 2014 The project was intended to detect and prevent insider threats such as "a soldier in good mental health becoming homicidal or suicidal", an "innocent insider becoming malicious", or "a government employee [who] abuses access privileges to share classified information". Specific cases mentioned are Nadal Malik Hasan and WikiLeaks source Chelsea Manning. Commercial applications may include finance. The intended recipients of the system output are operators in the counterintelligence agencies. A final report was published on May 11, 2015, detailing a system known as Anomaly Detection Engine for Networks, or ADEN, developed by the University of Maryland, College Park, whose goal was to "identify malicious users within a network." Using multiple datasets from Wikipedia, Slashdot, and others, researchers were able to identify vandals and malicious users on a website using both conventional algorithms and artificial intelligence. The Proactive Discovery of Insider Threats Using Graph Analysis and Learning was part of the ADAMS project. The Georgia Tech team includes noted high-performance computing researcher David Bader (computer scientist).

    Read more →
  • Medoid

    Medoid

    Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset. == Definition == Let X := { x 1 , x 2 , … , x n } {\textstyle {\mathcal {X}}:=\{x_{1},x_{2},\dots ,x_{n}\}} be a set of n {\textstyle n} points in a space with a distance function d. Medoid is defined as x medoid = arg ⁡ min y ∈ X ∑ i = 1 n d ( y , x i ) . {\displaystyle x_{\text{medoid}}=\arg \min _{y\in {\mathcal {X}}}\sum _{i=1}^{n}d(y,x_{i}).} == Clustering with medoids == Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: Partitioning Around Medoids (PAM), the standard k-medoids algorithm Hierarchical Clustering Around Medoids (HACAM), which uses medoids in hierarchical clustering == Algorithms to compute the medoid of a set == From the definition above, it is clear that the medoid of a set X {\displaystyle {\mathcal {X}}} can be computed after computing all pairwise distances between points in the ensemble. This would take O ( n 2 ) {\textstyle O(n^{2})} distance evaluations (with n = | X | {\displaystyle n=|{\mathcal {X}}|} ). In the worst case, one can not compute the medoid with fewer distance evaluations. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in O ( n ) {\textstyle O(n)} by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of O ( n log ⁡ n ϵ 2 ) {\textstyle O\left({\frac {n\log n}{\epsilon ^{2}}}\right)} distance computations to approximate the medoid within a factor of ( 1 + ϵ Δ ) {\textstyle (1+\epsilon \Delta )} with high probability, where Δ {\textstyle \Delta } is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover Δ {\textstyle \Delta } may not be known apriori. RAND was leveraged by TOPRANK which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs O ( n 5 3 log 4 3 ⁡ n ) {\textstyle O(n^{\frac {5}{3}}\log ^{\frac {4}{3}}n)} distance computations to find the exact medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with O ( n 3 2 2 Θ ( d ) ) {\textstyle O(n^{\frac {3}{2}}2^{\Theta (d)})} distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes O ( n log ⁡ n ) {\textstyle O(n\log n)} distance evaluations under statistical assumptions on the points. Correlated Sequential Halving also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time. == Implementations == An implementation of RAND, TOPRANK, and trimed can be found here. An implementation of Meddit can be found here and here. An implementation of Correlated Sequential Halving can be found here. == Medoids in text and natural language processing (NLP) == Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses. By clustering text data based on similarity, medoids can help identify representative examples within the dataset, leading to better understanding and interpretation of the data. === Text clustering === Text clustering is the process of grouping similar text or documents together based on their content. Medoid-based clustering algorithms can be employed to partition large amounts of text into clusters, with each cluster represented by a medoid document. This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems. === Text summarization === Text summarization aims to produce a concise and coherent summary of a larger text by extracting the most important and relevant information. Medoid-based clustering can be used to identify the most representative sentences in a document or a group of documents, which can then be combined to create a summary. This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text. === Sentiment analysis === Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. Medoid-based clustering can be applied to group text data based on similar sentiment patterns. By analyzing the medoid of each cluster, researchers can gain insights into the predominant sentiment of the cluster, helping in tasks such as opinion mining, customer feedback analysis, and social media monitoring. === Topic modeling === Topic modeling is a technique used to discover abstract topics that occur in a collection of documents. Medoid-based clustering can be applied to group documents with similar themes or topics. By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation. === Techniques for measuring text similarity in medoid-based clustering === When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively. Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed. The following are common techniques for measuring text similarity in medoid-based clustering: ==== Cosine similarity ==== Cosine similarity is a widely used measure to compare the similarity between two pieces of text. It calculates the cosine of the angle between two document vectors in a high-dimensional space. Cosine similarity ranges between -1 and 1, where a value closer to 1 indicates higher similarity, and a value closer to -1 indicates lower similarity. By visualizing two lines originating from the origin and extending to the respective points of interest, and then measuring the angle between these lines, one can determine the similarity between the associated points. Cosine similarity is less affected by document length, so it may be better at producing medoids that are representative of the content of a cluster instead of the lengt

    Read more →
  • Analogical modeling

    Analogical modeling

    Analogical modeling (AM) is a formal theory of exemplar based analogical reasoning, proposed by Royal Skousen, professor of Linguistics and English language at Brigham Young University in Provo, Utah. It is applicable to language modeling and other categorization tasks. Analogical modeling is related to connectionism and nearest neighbor approaches, in that it is data-based rather than abstraction-based; but it is distinguished by its ability to cope with imperfect datasets (such as caused by simulated short term memory limits) and to base predictions on all relevant segments of the dataset, whether near or far. In language modeling, AM has successfully predicted empirically valid forms for which no theoretical explanation was known (see the discussion of Finnish morphology in Skousen et al. 2002). == Implementation == === Overview === An exemplar-based model consists of a general-purpose modeling engine and a problem-specific dataset. Within the dataset, each exemplar (a case to be reasoned from, or an informative past experience) appears as a feature vector: a row of values for the set of parameters that define the problem. For example, in a spelling-to-sound task, the feature vector might consist of the letters of a word. Each exemplar in the dataset is stored with an outcome, such as a phoneme or phone to be generated. When the model is presented with a novel situation (in the form of an outcome-less feature vector), the engine algorithmically sorts the dataset to find exemplars that helpfully resemble it, and selects one, whose outcome is the model's prediction. The particulars of the algorithm distinguish one exemplar-based modeling system from another. In AM, we think of the feature values as characterizing a context, and the outcome as a behavior that occurs within that context. Accordingly, the novel situation is known as the given context. Given the known features of the context, the AM engine systematically generates all contexts that include it (all of its supracontexts), and extracts from the dataset the exemplars that belong to each. The engine then discards those supracontexts whose outcomes are inconsistent (this measure of consistency will be discussed further below), leaving an analogical set of supracontexts, and probabilistically selects an exemplar from the analogical set with a bias toward those in large supracontexts. This multilevel search exponentially magnifies the likelihood of a behavior's being predicted as it occurs reliably in settings that specifically resemble the given context. === Analogical modeling in detail === AM performs the same process for each case it is asked to evaluate. The given context, consisting of n variables, is used as a template to generate 2 n {\displaystyle 2^{n}} supracontexts. Each supracontext is a set of exemplars in which one or more variables have the same values that they do in the given context, and the other variables are ignored. In effect, each is a view of the data, created by filtering for some criteria of similarity to the given context, and the total set of supracontexts exhausts all such views. Alternatively, each supracontext is a theory of the task or a proposed rule whose predictive power needs to be evaluated. It is important to note that the supracontexts are not equal peers one with another; they are arranged by their distance from the given context, forming a hierarchy. If a supracontext specifies all of the variables that another one does and more, it is a subcontext of that other one, and it lies closer to the given context. (The hierarchy is not strictly branching; each supracontext can itself be a subcontext of several others, and can have several subcontexts.) This hierarchy becomes significant in the next step of the algorithm. The engine now chooses the analogical set from among the supracontexts. A supracontext may contain exemplars that only exhibit one behavior; it is deterministically homogeneous and is included. It is a view of the data that displays regularity, or a relevant theory that has never yet been disproven. A supracontext may exhibit several behaviors, but contain no exemplars that occur in any more specific supracontext (that is, in any of its subcontexts); in this case it is non-deterministically homogeneous and is included. Here there is no great evidence that a systematic behavior occurs, but also no counterargument. Finally, a supracontext may be heterogeneous, meaning that it exhibits behaviors that are found in a subcontext (closer to the given context), and also behaviors that are not. Where the ambiguous behavior of the nondeterministically homogeneous supracontext was accepted, this is rejected because the intervening subcontext demonstrates that there is a better theory to be found. The heterogeneous supracontext is therefore excluded. This guarantees that we see an increase in meaningfully consistent behavior in the analogical set as we approach the given context. With the analogical set chosen, each appearance of an exemplar (for a given exemplar may appear in several of the analogical supracontexts) is given a pointer to every other appearance of an exemplar within its supracontexts. One of these pointers is then selected at random and followed, and the exemplar to which it points provides the outcome. This gives each supracontext an importance proportional to the square of its size, and makes each exemplar likely to be selected in direct proportion to the sum of the sizes of all analogically consistent supracontexts in which it appears. Then, of course, the probability of predicting a particular outcome is proportional to the summed probabilities of all the exemplars that support it. (Skousen 2002, in Skousen et al. 2002, pp. 11–25, and Skousen 2003, both passim) === Formulas === Given a context with n {\displaystyle n} elements: total number of pairings: n 2 {\displaystyle n^{2}} number of agreements for outcome i: n i 2 {\displaystyle n_{i}^{2}} number of disagreements for outcome i: n i ( n − n i ) {\displaystyle n_{i}(n-n_{i})} total number of agreements: ∑ n i 2 {\displaystyle \sum {n_{i}^{2}}} total number of disagreements: ∑ n i ( n − n i ) = n 2 − ∑ n i 2 {\displaystyle \sum {n_{i}(n-n_{i})}=n^{2}-\sum {n_{i}^{2}}} === Example === This terminology is best understood through an example. In the example used in the second chapter of Skousen (1989), each context consists of three variables with potential values 0-3 Variable 1: 0,1,2,3 Variable 2: 0,1,2,3 Variable 3: 0,1,2,3 The two outcomes for the dataset are e and r, and the exemplars are: 3 1 0 e 0 3 2 r 2 1 0 r 2 1 2 r 3 1 1 r We define a network of pointers like so: The solid lines represent pointers between exemplars with matching outcomes; the dotted lines represent pointers between exemplars with non-matching outcomes. The statistics for this example are as follows: n = 5 {\displaystyle n=5} n r = 4 {\displaystyle n_{r}=4} n e = 1 {\displaystyle n_{e}=1} total number of pairings: n 2 = 25 {\displaystyle n^{2}=25} number of agreements for outcome r: n r 2 = 16 {\displaystyle n_{r}^{2}=16} number of agreements for outcome e: n e 2 = 1 {\displaystyle n_{e}^{2}=1} number of disagreements for outcome r: n r ( n − n r ) = 4 {\displaystyle n_{r}(n-n_{r})=4} number of disagreements for outcome e: n e ( n − n e ) = 4 {\displaystyle n_{e}(n-n_{e})=4} total number of agreements: n r 2 + n e 2 = 17 {\displaystyle n_{r}^{2}+n_{e}^{2}=17} total number of disagreements: n r ( n − n r ) + n e ( n − n e ) = n 2 − ( n r 2 + n e 2 ) = 8 {\displaystyle n_{r}(n-n_{r})+n_{e}(n-n_{e})=n^{2}-(n_{r}^{2}+n_{e}^{2})=8} uncertainty or fraction of disagreement: 8 / 25 = .32 {\displaystyle 8/25=.32} Behavior can only be predicted for a given context; in this example, let us predict the outcome for the context "3 1 2". To do this, we first find all of the contexts containing the given context; these contexts are called supracontexts. We find the supracontexts by systematically eliminating the variables in the given context; with m variables, there will generally be 2 m {\displaystyle 2^{m}} supracontexts. The following table lists each of the sub- and supracontexts; x means "not x", and - means "anything". These contexts are shown in the venn diagram below: The next step is to determine which exemplars belong to which contexts in order to determine which of the contexts are homogeneous. The table below shows each of the subcontexts, their behavior in terms of the given exemplars, and the number of disagreements within the behavior: Analyzing the subcontexts in the table above, we see that there is only 1 subcontext with any disagreements: "3 1 2", which in the dataset consists of "3 1 0 e" and "3 1 1 r". There are 2 disagreements in this subcontext; 1 pointing from each of the exemplars to the other (see the pointer network pictured above). Therefore, only supracontexts containing this subcontext will contain any disagreements. We use a simple rule to identify the homogeneous supraco

    Read more →
  • Facial recognition system

    Facial recognition system

    A facial recognition system is a technology potentially capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and works by pinpointing and measuring facial features from a given image. Development on similar systems began in the 1960s as a form of computer application. Since their inception, facial recognition systems have seen wider uses in recent times on smartphones and in other forms of technology, such as robotics. Because computerized facial recognition involves the measurement of a human's physiological characteristics, facial recognition systems are categorized as biometrics. Although the accuracy of facial recognition systems as a biometric technology is lower than iris recognition, fingerprint image acquisition, palm recognition or voice recognition, it is widely adopted due to its contactless process. Facial recognition systems have been deployed in advanced human–computer interaction, video surveillance, law enforcement, passenger screening, decisions on employment and housing, and automatic indexing of images. Facial recognition systems are employed throughout the world today by governments and private companies. Their effectiveness varies, and some systems have previously been scrapped because of their ineffectiveness. The use of facial recognition systems has also raised controversy, with claims that the systems violate citizens' privacy, commonly make incorrect identifications, encourage gender norms and racial profiling, and do not protect important biometric data. The appearance of synthetic media such as deepfakes has also raised concerns about its security. These claims have led to the ban of facial recognition systems in several cities in the United States. Growing societal concerns led social networking company Meta Platforms to shut down its Facebook facial recognition system in 2021, deleting the face-scan data of more than one billion users. The change represented one of the largest shifts in facial recognition usage in the technology's history. IBM also stopped offering facial recognition technology due to similar concerns. == History of facial recognition technology == Automated facial recognition was pioneered in the 1960s by Woody Bledsoe, Helen Chan Wolf, and Charles Bisson, whose work focused on teaching computers to recognize human faces. Their early facial recognition project was dubbed "man-machine" because a human first needed to establish the coordinates of facial features in a photograph before they could be used by a computer for recognition. Using a graphics tablet, a human would pinpoint facial features coordinates, such as the pupil centers, the inside and outside corners of eyes, and the widows peak in the hairline. The coordinates were used to calculate 20 individual distances, including the width of the mouth and of the eyes. A human could process about 40 pictures an hour, building a database of these computed distances. A computer would then automatically compare the distances for each photograph, calculate the difference between the distances, and return the closed records as a possible match. In 1970, Takeo Kanade publicly demonstrated a face-matching system that located anatomical features such as the chin and calculated the distance ratio between facial features without human intervention. Later tests revealed that the system could not always reliably identify facial features. Nonetheless, interest in the subject grew and in 1977 Kanade published the first detailed book on facial recognition technology. In 1993, the Defense Advanced Research Project Agency (DARPA) and the Army Research Laboratory (ARL) established the face recognition technology program FERET to develop "automatic face recognition capabilities" that could be employed in a productive real life environment "to assist security, intelligence, and law enforcement personnel in the performance of their duties." Face recognition systems that had been trialled in research labs were evaluated. The FERET tests found that while the performance of existing automated facial recognition systems varied, a handful of existing methods could viably be used to recognize faces in still images taken in a controlled environment. The FERET tests spawned three US companies that sold automated facial recognition systems. Vision Corporation and Miros Inc were founded in 1994, by researchers who used the results of the FERET tests as a selling point. Viisage Technology was established by an identification card defense contractor in 1996 to commercially exploit the rights to the facial recognition algorithm developed by Alex Pentland at MIT. Following the 1993 FERET face-recognition vendor test, the Department of Motor Vehicles (DMV) offices in West Virginia and New Mexico became the first DMV offices to use automated facial recognition systems to prevent people from obtaining multiple driving licenses using different names. Driver's licenses in the United States were at that point a commonly accepted form of photo identification. DMV offices across the United States were undergoing a technological upgrade and were in the process of establishing databases of digital ID photographs. This enabled DMV offices to deploy the facial recognition systems on the market to search photographs for new driving licenses against the existing DMV database. DMV offices became one of the first major markets for automated facial recognition technology and introduced US citizens to facial recognition as a standard method of identification. The increase of the US prison population in the 1990s prompted U.S. states to established connected and automated identification systems that incorporated digital biometric databases, in some instances this included facial recognition. In 1999, Minnesota incorporated the facial recognition system FaceIT by Visionics into a mug shot booking system that allowed police, judges and court officers to track criminals across the state. Until the 1990s, facial recognition systems were developed primarily by using photographic portraits of human faces. Research on face recognition to reliably locate a face in an image that contains other objects gained traction in the early 1990s with the principal component analysis (PCA). The PCA method of face detection is also known as Eigenface and was developed by Matthew Turk and Alex Pentland. Turk and Pentland combined the conceptual approach of the Karhunen–Loève theorem and factor analysis, to develop a linear model. Eigenfaces are determined based on global and orthogonal features in human faces. A human face is calculated as a weighted combination of a number of Eigenfaces. Because few Eigenfaces were used to encode human faces of a given population, Turk and Pentland's PCA face detection method greatly reduced the amount of data that had to be processed to detect a face. Pentland in 1994 defined Eigenface features, including eigen eyes, eigen mouths and eigen noses, to advance the use of PCA in facial recognition. In 1997, the PCA Eigenface method of face recognition was improved upon using linear discriminant analysis (LDA) to produce Fisherfaces. LDA Fisherfaces became dominantly used in PCA feature based face recognition. While Eigenfaces were also used for face reconstruction. In these approaches no global structure of the face is calculated which links the facial features or parts. Purely feature based approaches to facial recognition were overtaken in the late 1990s by the Bochum system, which used Gabor filter to record the face features and computed a grid of the face structure to link the features. Christoph von der Malsburg and his research team at the University of Bochum developed Elastic Bunch Graph Matching in the mid-1990s to extract a face out of an image using skin segmentation. By 1997, the face detection method developed by Malsburg outperformed most other facial detection systems on the market. The so-called "Bochum system" of face detection was sold commercially on the market as ZN-Face to operators of airports and other busy locations. The software was "robust enough to make identifications from less-than-perfect face views. It can also often see through such impediments to identification as mustaches, beards, changed hairstyles and glasses—even sunglasses". Real-time face detection in video footage became possible in 2001 with the Viola–Jones object detection framework for faces. Paul Viola and Michael Jones combined their face detection method with the Haar-like feature approach to object recognition in digital images to launch AdaBoost, the first real-time frontal-view face detector. By 2015, the Viola–Jones algorithm had been implemented using small low power detectors on handheld devices and embedded systems. Therefore, the Viola–Jones algorithm has not only broadened the practical application of face recognition systems but

    Read more →
  • Graphics processing unit

    Graphics processing unit

    A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a component on a discrete graphics card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. GPUs are increasingly being used for artificial intelligence (AI) processing due to linear algebra acceleration, which is also used extensively in graphics processing. Although there is no single definition of the term, and it may be used to describe any video display system, in modern use a GPU includes the ability to internally perform the calculations needed for various graphics tasks, like rotating and scaling 3D images, and often the additional ability to run custom programs known as shaders. This contrasts with earlier graphics controllers known as video display controllers which had no internal calculation capabilities, or blitters, which performed only basic memory movement operations. The modern GPU emerged during the 1990s, adding the ability to perform operations like drawing lines and text without CPU help, and later adding 3D functionality. Graphics functions are generally independent and this lends these tasks to being implemented on separate calculation engines. Modern GPUs include hundreds, or thousands, of calculation units. This made them useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. The ability of GPUs to rapidly perform vast numbers of calculations has led to their adoption in diverse fields including artificial intelligence (AI) where they excel at handling data-intensive and computationally demanding tasks. Other non-graphical uses include the training of neural networks and cryptocurrency mining. == History == === 1960s === Dedicated 3D graphics hardware dates back to graphic terminals such as the Adage AGT-30 from 1967 with analog matrix processors. In 1969 Evans & Sutherland (E&S) introduced the Line Drawing System-1 (LDS-1), which was the first all-digital system to provide matrix multiplication. Also in 1969, the low-cost graphics terminal IMLAC PDS-1 was introduced. It later saw use as an early 3D gaming machine with the likes of Maze War. === 1970s === In professional hardware, in 1972 PLATO IV system becomes operational at the University of Illinois Urbana-Champaign. Between around 1973 and 1978, several networked multiplayer wireframe 3D games are implemented and popularized by users of the system. Also in 1972, the E&S Continuous Tone 1 (CT1) "Watkins box" system (consisting of an E&S LDS-2 and Shaded Picture System) is delivered to Case Western Reserve University. It offered the first real-time Gouraud shading. In 1975, a joint effort between Evans & Sutherland Computer Corporation and the University of Utah's computer graphics department results in the first ever MOSFET video framebuffer, capable of color and smooth shading. E&S Continuous Tone 3 (CT3) system was delivered in 1977 to Lufthansa for pilot training using computer simulation. It was the first graphics system capable of real-time texture mapping. Ikonas made graphics systems with 8- and 24-bit graphics and 3D acceleration in the late 70s. Arcade system boards have used specialized 2D graphics circuits since the 1970s. In early video game hardware, RAM for frame buffers was expensive, so video chips composited data together as the display was being scanned out on the monitor. A specialized barrel shifter circuit helped the CPU animate the framebuffer graphics for various 1970s arcade video games from Midway and Taito, such as Gun Fight (1975), Sea Wolf (1976), and Space Invaders (1978). The Namco Galaxian arcade system in 1979 used specialized graphics hardware that supported RGB color, multi-colored sprites, and tilemap backgrounds. The Galaxian hardware was widely used during the golden age of arcade video games, by game companies such as Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega, and Taito. The Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor. Atari 8-bit computers (1979) had ANTIC, a video processor which interpreted instructions describing a "display list"—the way the scan lines map to specific bitmapped or character modes and where the memory is stored (so there did not need to be a contiguous frame buffer). 6502 machine code subroutines could be triggered on scan lines by setting a bit on a display list instruction. ANTIC also supported smooth vertical and horizontal scrolling independent of the CPU. === 1980s === In the 1980s significant advancements were made in professional 3D graphics hardware. Perhaps most impactful was the 1981 development of the Geometry Engine, a VLSI vector processor ASIC designed by Jim Clark and Marc Hannah at Stanford University. This processor is the forerunner of modern tensor cores and other similar processors marketed for graphics and AI. The Geometry Engine went on to be used in Silicon Graphics workstations for many years. Silicon Graphics's first product, shipped in November 1983, was the IRIS 1000, a terminal with hardware-accelerated 3D graphics based on the Geometry Engine. The Geometry Engine was capable of approximately 6 million operations per second. The 1981 NEC μPD7220 was the first implementation of a personal computer graphics display processor as a single large-scale integration (LSI) integrated circuit chip. This enabled the design of low-cost, high-performance video graphics cards such as those from Number Nine Visual Technology. It became the best-known GPU until the mid-1980s. It was the first fully integrated VLSI (very large-scale integration) metal–oxide–semiconductor (NMOS) graphics display processor for PCs, supported up to 1024×1024 resolution, and laid the foundations for the PC graphics market. It was used in a number of graphics cards and was licensed for clones such as the Intel 82720, the first of Intel's graphics processing units. The Williams Electronics arcade games Robotron: 2084, Joust, Sinistar, and Bubbles, all released in 1982, contain custom blitter chips for operating on 16-color bitmaps. In 1984, Hitachi released the ARTC HD63484, the first major CMOS graphics processor for personal computers. The ARTC could display up to 4K resolution when in monochrome mode. It was used in a number of graphics cards and terminals during the late 1980s. In 1985, the Amiga was released with a custom graphics chip called Agnus including a blitter for bitmap manipulation, line drawing, and area fill. It also included a coprocessor with its own simple instruction set, that was capable of manipulating graphics hardware registers in sync with the video beam (e.g. for per-scanline palette switches, sprite multiplexing, and hardware windowing), or driving the blitter. Also in 1985, IBM released the Professional Graphics Controller, designed by later to be Nvidia co-founder Curtis Priem, which was a rudimentary 3D card with 640 × 480 256-color graphics which used a dedicated CPU to draw graphics independently of the main system. It was used as the basis of cards by a number of makers (including Matrox) and its analog RGB signaling led directly to the VGA video standard. Priem later in the 80s worked on the influential Sun Microsystems GX (also known as cgsix) accelerated 2D graphics card. In 1986, Texas Instruments released the TMS34010, the first fully programmable graphics processor. It could run general-purpose code but also had a graphics-oriented instruction set. During 1990–1992, this chip became the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards. Following in 1987, the IBM 8514 graphics system was released. It was one of the first video cards for IBM PC compatibles that implemented fixed-function 2D primitives in electronic hardware. Sharp's X68000, released in 1987, used a custom graphics chipset with a 65,536 color palette and hardware support for sprites, scrolling, and multiple playfields. It served as a development machine for Capcom's CP System arcade board. Fujitsu's FM Towns computer, released in 1989, had support for a 16,777,216 color palette. For context, IBM also introduced its Video Graphics Array (VGA) display system in 1987, with a maximum resolution of 640 × 480 pixels. Unlike 8514/A, VGA had no hardware acceleration features. In November 1988, NEC Home Electronics announced its creation of the Video Electronics Standards Association (VESA) to develop and promote a Super VGA (SVGA) computer display standard as a successor to VGA. Super VGA enabled graphics display resolutions up to 800 × 600 pixels, a 56% increase. In 1988 SGI sold IRIS workstation graphics with 10-12 Geometry Engines and introduced the IrisVision add-in board for IBM MicroChannel bus (RS/6000) based on the Geometry Engine as well. In 1988 as well, the first dedicated polygonal 3D graphics boards in arcade machines were introduced wit

    Read more →
  • Jackknife variance estimates for random forest

    Jackknife variance estimates for random forest

    In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects. == Jackknife variance estimates == The sampling variance of bagged learners is: V ( x ) = V a r [ θ ^ ∞ ( x ) ] {\displaystyle V(x)=Var[{\hat {\theta }}^{\infty }(x)]} Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as: V ^ j = n − 1 n ∑ i = 1 n ( θ ^ ( − i ) − θ ¯ ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\hat {\theta }}_{(-i)}-{\overline {\theta }})^{2}} In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as: V ^ j = n − 1 n ∑ i = 1 n ( t ¯ ( − i ) ⋆ ( x ) − t ¯ ⋆ ( x ) ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\overline {t}}_{(-i)}^{\star }(x)-{\overline {t}}^{\star }(x))^{2}} Here, t ⋆ {\displaystyle t^{\star }} denotes a decision tree after training, t ( − i ) ⋆ {\displaystyle t_{(-i)}^{\star }} denotes the result based on samples without i t h {\displaystyle ith} observation. == Examples == E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low. Here, accuracy is measured by error rate, which is defined as: E r r o r R a t e = 1 N ∑ i = 1 N ∑ j = 1 M y i j , {\displaystyle ErrorRate={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij},} Here N is also the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy: l o g l o s s = 1 N ∑ i = 1 N ∑ j = 1 M y i j l o g ( p i j ) {\displaystyle logloss={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij}log(p_{ij})} Here N is the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. p i j {\displaystyle p_{ij}} is the predicted probability of i t h {\displaystyle ith} observation in class j {\displaystyle j} .This method is used in Kaggle These two methods are very similar. == Modification for bias == When using Monte Carlo MSEs for estimating V I J ∞ {\displaystyle V_{IJ}^{\infty }} and V J ∞ {\displaystyle V_{J}^{\infty }} , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large: E [ V ^ I J B ] − V ^ I J ∞ ≈ n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle E[{\hat {V}}_{IJ}^{B}]-{\hat {V}}_{IJ}^{\infty }\approx {\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}} To eliminate this influence, bias-corrected modifications are suggested: V ^ I J − U B = V ^ I J B − n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle {\hat {V}}_{IJ-U}^{B}={\hat {V}}_{IJ}^{B}-{\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}} V ^ J − U B = V ^ J B − ( e − 1 ) n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle {\hat {V}}_{J-U}^{B}={\hat {V}}_{J}^{B}-(e-1){\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

    Read more →
  • Quadratic unconstrained binary optimization

    Quadratic unconstrained binary optimization

    Quadratic unconstrained binary optimization (QUBO), also known as unconstrained binary quadratic programming (UBQP), is a combinatorial optimization problem with a wide range of applications from finance and economics to machine learning. QUBO is an NP hard problem, and for many classical problems from theoretical computer science, like maximum cut, graph coloring and the partition problem, embeddings into QUBO have been formulated. Embeddings for machine learning models include support-vector machines, clustering and probabilistic graphical models. Moreover, due to its close connection to Ising models, QUBO constitutes a central problem class for adiabatic quantum computation, where it is solved through a physical process called quantum annealing. == Definition == Let B = { 0 , 1 } {\displaystyle \mathbb {B} =\lbrace 0,1\rbrace } the set of binary digits (or bits), then B n {\displaystyle \mathbb {B} ^{n}} is the set of binary vectors of fixed length n ∈ N {\displaystyle n\in \mathbb {N} } . Given a symmetric or upper triangular matrix Q ∈ R n × n {\displaystyle {\boldsymbol {Q}}\in \mathbb {R} ^{n\times n}} , whose entries Q i j {\displaystyle Q_{ij}} define a weight for each pair of indices i , j ∈ { 1 , … , n } {\displaystyle i,j\in \lbrace 1,\dots ,n\rbrace } , we can define the function f Q : B n → R {\displaystyle f_{\boldsymbol {Q}}:\mathbb {B} ^{n}\rightarrow \mathbb {R} } that assigns a value to each binary vector x {\displaystyle {\boldsymbol {x}}} through f Q ( x ) = x ⊺ Q x = ∑ i = 1 n ∑ j = 1 n Q i j x i x j . {\displaystyle f_{\boldsymbol {Q}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}}=\sum _{i=1}^{n}\sum _{j=1}^{n}Q_{ij}x_{i}x_{j}.} Alternatively, the linear and quadratic parts can be separated as f Q ′ , q ( x ) = x ⊺ Q ′ x + q ⊺ x , {\displaystyle f_{{\boldsymbol {Q}}',{\boldsymbol {q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Q}}'{\boldsymbol {x}}+{\boldsymbol {q}}^{\intercal }{\boldsymbol {x}},} where Q ′ ∈ R n × n {\displaystyle {\boldsymbol {Q}}'\in \mathbb {R} ^{n\times n}} and q ∈ R n {\displaystyle {\boldsymbol {q}}\in \mathbb {R} ^{n}} . This is equivalent to the previous definition through Q = Q ′ + diag ⁡ [ q ] {\displaystyle {\boldsymbol {Q}}={\boldsymbol {Q}}'+\operatorname {diag} [{\boldsymbol {q}}]} using the diag operator, exploiting that x = x ⋅ x {\displaystyle x=x\cdot x} for all binary values x {\displaystyle x} . Intuitively, the weight Q i j {\displaystyle Q_{ij}} is added if both x i = 1 {\displaystyle x_{i}=1} and x j = 1 {\displaystyle x_{j}=1} . The QUBO problem consists of finding a binary vector x ∗ {\displaystyle {\boldsymbol {x}}^{}} that minimizes f Q {\displaystyle f_{\boldsymbol {Q}}} , i.e., ∀ x ∈ B n : f Q ( x ∗ ) ≤ f Q ( x ) {\displaystyle \forall {\boldsymbol {x}}\in \mathbb {B} ^{n}:~f_{\boldsymbol {Q}}({\boldsymbol {x}}^{})\leq f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In general, x ∗ {\displaystyle {\boldsymbol {x}}^{}} is not unique, meaning there may be a set of minimizing vectors with equal value w.r.t. f Q {\displaystyle f_{\boldsymbol {Q}}} . The complexity of QUBO arises from the number of candidate binary vectors to be evaluated, as | B n | = 2 n {\displaystyle \left|\mathbb {B} ^{n}\right|=2^{n}} grows exponentially in n {\displaystyle n} . Sometimes, QUBO is defined as the problem of maximizing f Q {\displaystyle f_{\boldsymbol {Q}}} , which is equivalent to minimizing f − Q = − f Q {\displaystyle f_{-{\boldsymbol {Q}}}=-f_{\boldsymbol {Q}}} . == Properties == QUBO is scale invariant for positive factors α > 0 {\displaystyle \alpha >0} , which leave the optimum x ∗ {\displaystyle {\boldsymbol {x}}^{}} unchanged: f α Q ( x ) = x ⊺ ( α Q ) x = α ( x ⊺ Q x ) = α f Q ( x ) {\displaystyle f_{\alpha {\boldsymbol {Q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }(\alpha {\boldsymbol {Q}}){\boldsymbol {x}}=\alpha ({\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}})=\alpha f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In its general form, QUBO is NP-hard and cannot be solved efficiently by any known polynomial-time algorithm. However, there are polynomially-solvable special cases, where Q {\displaystyle {\boldsymbol {Q}}} has certain properties, for example: If all coefficients are positive, the optimum is trivially x ∗ = ( 0 , … , 0 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(0,\dots ,0)^{\intercal }} . Similarly, if all coefficients are negative, the optimum is x ∗ = ( 1 , … , 1 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(1,\dots ,1)^{\intercal }} . If Q {\displaystyle {\boldsymbol {Q}}} is diagonal, the bits can be optimized independently, and the problem is solvable in O ( n ) {\displaystyle {\mathcal {O}}(n)} . The optimal variable assignments are simply x i ∗ = 1 {\displaystyle x_{i}^{}=1} if Q i i < 0 {\displaystyle Q_{ii}<0} , and x i ∗ = 0 {\displaystyle x_{i}^{}=0} otherwise. If all off-diagonal elements of Q {\displaystyle {\boldsymbol {Q}}} are non-positive, the corresponding QUBO problem is solvable in polynomial time. QUBO can be solved using integer linear programming solvers like CPLEX or Gurobi Optimizer. This is possible since QUBO can be reformulated as a linear constrained binary optimization problem. To achieve this, substitute the product x i x j {\displaystyle x_{i}x_{j}} by an additional binary variable z i j ∈ B {\displaystyle z_{ij}\in \mathbb {B} } and add the constraints x i ≥ z i j {\displaystyle x_{i}\geq z_{ij}} , x j ≥ z i j {\displaystyle x_{j}\geq z_{ij}} and x i + x j − 1 ≤ z i j {\displaystyle x_{i}+x_{j}-1\leq z_{ij}} . Note that z i j {\displaystyle z_{ij}} can also be relaxed to continuous variables within the bounds zero and one. == Applications == QUBO is a structurally simple, yet computationally hard optimization problem. It can be used to encode a wide range of optimization problems from various scientific areas. === Maximum Cut === Given a graph G = ( V , E ) {\displaystyle G=(V,E)} with vertex set V = { 1 , … , n } {\displaystyle V=\lbrace 1,\dots ,n\rbrace } and edges E ⊆ V × V {\displaystyle E\subseteq V\times V} , the maximum cut (max-cut) problem consists of finding two subsets S , T ⊆ V {\displaystyle S,T\subseteq V} with T = V ∖ S {\displaystyle T=V\setminus S} , such that the number of edges between S {\displaystyle S} and T {\displaystyle T} is maximized. The more general weighted max-cut problem assumes edge weights w i j ≥ 0 ∀ i , j ∈ V {\displaystyle w_{ij}\geq 0~\forall i,j\in V} , with ( i , j ) ∉ E ⇒ w i j = 0 {\displaystyle (i,j)\notin E\Rightarrow w_{ij}=0} , and asks for a partition S , T ⊆ V {\displaystyle S,T\subseteq V} that maximizes the sum of edge weights between S {\displaystyle S} and T {\displaystyle T} , i.e., max S ⊆ V ∑ i ∈ S , j ∉ S w i j . {\displaystyle \max _{S\subseteq V}\sum _{i\in S,j\notin S}w_{ij}.} By setting w i j = 1 {\displaystyle w_{ij}=1} for all ( i , j ) ∈ E {\displaystyle (i,j)\in E} this becomes equivalent to the original max-cut problem above, which is why we focus on this more general form in the following. For every vertex in i ∈ V {\displaystyle i\in V} we introduce a binary variable x i {\displaystyle x_{i}} with the interpretation x i = 0 {\displaystyle x_{i}=0} if i ∈ S {\displaystyle i\in S} and x i = 1 {\displaystyle x_{i}=1} if i ∈ T {\displaystyle i\in T} . As T = V ∖ S {\displaystyle T=V\setminus S} , every i {\displaystyle i} is in exactly one set, meaning there is a 1:1 correspondence between binary vectors x ∈ B n {\displaystyle {\boldsymbol {x}}\in \mathbb {B} ^{n}} and partitions of V {\displaystyle V} into two subsets. We observe that, for any i , j ∈ V {\displaystyle i,j\in V} , the expression x i ( 1 − x j ) + ( 1 − x i ) x j {\displaystyle x_{i}(1-x_{j})+(1-x_{i})x_{j}} evaluates to 1 if and only if i {\displaystyle i} and j {\displaystyle j} are in different subsets, equivalent to logical XOR. Let W ∈ R + n × n {\displaystyle {\boldsymbol {W}}\in \mathbb {R} _{+}^{n\times n}} with W i j = w i j ∀ i , j ∈ V {\displaystyle W_{ij}=w_{ij}~\forall i,j\in V} . By extending above expression to matrix-vector form we find that x ⊺ W ( 1 − x ) + ( 1 − x ) ⊺ W x = − 2 x ⊺ W x + ( W 1 + W ⊺ 1 ) ⊺ x {\displaystyle {\boldsymbol {x}}^{\intercal }{\boldsymbol {W}}({\boldsymbol {1}}-{\boldsymbol {x}})+({\boldsymbol {1}}-{\boldsymbol {x}})^{\intercal }{\boldsymbol {Wx}}=-2{\boldsymbol {x}}^{\intercal }{\boldsymbol {Wx}}+({\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\boldsymbol {1}})^{\intercal }{\boldsymbol {x}}} is the sum of weights of all edges between S {\displaystyle S} and T {\displaystyle T} , where 1 = ( 1 , 1 , … , 1 ) ⊺ ∈ R n {\displaystyle {\boldsymbol {1}}=(1,1,\dots ,1)^{\intercal }\in \mathbb {R} ^{n}} . As this is a quadratic function over x {\displaystyle {\boldsymbol {x}}} , it is a QUBO problem whose parameter matrix we can read from above expression as Q = 2 W − diag ⁡ [ W 1 + W ⊺ 1 ] , {\displaystyle {\boldsymbol {Q}}=2{\boldsymbol {W}}-\operatorname {diag} [{\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\bol

    Read more →
  • Memetic algorithm

    Memetic algorithm

    In computer science and operations research, a memetic algorithm (MA) is an extension of an evolutionary algorithm (EA) that aims to accelerate the evolutionary search for the optimum. An EA is a metaheuristic that reproduces the basic principles of biological evolution as a computer algorithm in order to solve challenging optimization or planning tasks, at least approximately. An MA uses one or more suitable heuristics or local search techniques to improve the quality of solutions generated by the EA and to speed up the search. The effects on the reliability of finding the global optimum depend on both the use case and the design of the MA. Memetic algorithms represent one of the recent growing areas of research in evolutionary computation. The term MA is now widely used as a synergy of evolutionary or any population-based approach with separate individual learning or local improvement procedures for problem search. Quite often, MAs are also referred to in the literature as Baldwinian evolutionary algorithms, Lamarckian EAs, cultural algorithms, or genetic local search. == Introduction == Inspired by both Darwinian principles of natural evolution and Dawkins' notion of a meme, the term memetic algorithm (MA) was introduced by Pablo Moscato in his technical report in 1989 where he viewed MA as being close to a form of population-based hybrid genetic algorithm (GA) coupled with an individual learning procedure capable of performing local refinements. The metaphorical parallels, on the one hand, to Darwinian evolution and, on the other hand, between memes and domain specific (local search) heuristics are captured within memetic algorithms thus rendering a methodology that balances well between generality and problem specificity. This two-stage nature makes them a special case of dual-phase evolution. The basic idea behind an MA is to combine the advantages of a global search performed by an EA (or another global search method) with the local refinement provided by one or more local search techniques, while avoiding their drawbacks. The main disadvantage of EAs is that, when searching in the vicinity of an optimum, they perform poorly in determining the exact position of that optimum. The downside of local search methods lies simply in the locality of their search relative to the chosen starting point. The combination of these two classes of methods aims to merge global and local search so that the advantages of both approaches can be leveraged. The idea of this approach can be illustrated by the search for the highest mountain in the Alps. A local search method would climb one of the mountains near the starting point, ignoring Mont Blanc as long as the starting point is not in its vicinity. An EA, on the other hand, will likely only find Mont Blanc after examining many other mountains, valleys, and hills, and then it will have difficulty identifying the summit cross. From the perspective of an MA’s global search procedure, however, only the summits of hills and mountains are seen, and its search is limited to finding the best summit. The open question is whether the additional effort required for the local search is worthwhile. This depends not only on the design of the MA but also on the specific application and the local search methods used. In the context of complex optimization, many different instantiations of memetic algorithms have been reported across a wide range of application domains, in general, converging to high-quality solutions more efficiently than their conventional evolutionary counterparts. In general, using the ideas of memetics within a computational framework is called memetic computing or memetic computation (MC). With MC, the traits of universal Darwinism are more appropriately captured. Viewed in this perspective, MA is a more constrained notion of MC. More specifically, MA covers one area of MC, in particular dealing with areas of evolutionary algorithms that marry other deterministic refinement techniques for solving optimization problems. MC extends the notion of memes to cover conceptual entities of knowledge-enhanced procedures or representations. == Theoretical Background == The no-free-lunch theorems of optimization and search state that all optimization strategies are equally effective with respect to the set of all optimization problems. Conversely, this means that one can expect the following: The more efficiently an algorithm solves a problem or class of problems, the less general it is and the more problem-specific knowledge it builds on. This insight leads directly to the recommendation to complement generally applicable metaheuristics with application-specific methods or heuristics, which fits well with the concept of MAs. == The development of MAs == === 1st generation === Pablo Moscato characterized an MA as follows: "Memetic algorithms are a marriage between a population-based global search and the heuristic local search made by each of the individuals. ... The mechanisms to do local search can be to reach a local optimum or to improve (regarding the objective cost function) up to a predetermined level." And he emphasizes "I am not constraining an MA to a genetic representation.". This original definition of MA although encompasses characteristics of cultural evolution (in the form of local refinement) in the search cycle, it may not qualify as a true evolving system according to universal Darwinism, since all the core principles of inheritance/memetic transmission, variation, and selection are missing. This suggests why the term MA stirred up criticisms and controversies among researchers when first introduced. The following pseudo code would correspond to this general definition of an MA: Pseudo code Procedure Memetic Algorithm Initialize: Generate an initial population, evaluate the individuals and assign a quality value to them; while Stopping conditions are not satisfied do Evolve a new population using stochastic search operators. Evaluate all individuals in the population and assign a quality value to them. Select the subset of individuals, Ω i l {\displaystyle \Omega _{il}} , that should undergo the individual improvement procedure. for each individual in Ω i l {\displaystyle \Omega _{il}} do Perform individual learning using meme(s) with frequency or probability of f i l {\displaystyle f_{il}} , with an intensity of t i l {\displaystyle t_{il}} . Proceed with Lamarckian or Baldwinian learning. end for end while Lamarckian learning in this context means to update the chromosome according to the improved solution found by the individual learning step, while Baldwinian learning leaves the chromosome unchanged and uses only the improved fitness. This pseudo code leaves open which steps are based on the fitness of the individuals and which are not. In question are the evolving of the new population and the selection of Ω i l {\displaystyle \Omega _{il}} . Since most MA implementations are based on EAs, the pseudo code of a corresponding representative of the first generation is also given here, following Krasnogor: Pseudo code Procedure Memetic Algorithm Based on an EA Initialization: t = 0 {\displaystyle t=0} ; // Initialization of the generation counter Randomly generate an initial population P ( t ) {\displaystyle P(t)} ; Compute the fitness f ( p ) ∀ p ∈ P ( t ) {\displaystyle f(p)\ \ \forall p\in P(t)} ; while Stopping conditions are not satisfied do Selection: Accordingly to f ( p ) {\displaystyle f(p)} choose a subset of P ( t ) {\displaystyle P(t)} and store it in M ( t ) {\displaystyle M(t)} ; Offspring: Recombine and mutate individuals p ∈ M ( t ) {\displaystyle p\in M(t)} and store them in M ′ ( t ) {\displaystyle M'(t)} ; Learning: Improve p ′ {\displaystyle p'} by local search or heuristic ∀ p ′ ∈ M ′ ( t ) {\displaystyle \forall p'\in M'(t)} ; Evaluation: Compute the fitness f ( p ′ ) ∀ p ′ ∈ M ′ ( t ) {\displaystyle f(p')\ \ \forall p'\in M'(t)} ; if Lamarckian learning then Update chromosome of p ′ {\displaystyle p'} according to improvement ∀ p ′ ∈ M ′ ( t ) {\displaystyle \forall p'\in M'(t)} ; fi New generation: Generate P ( t + 1 ) {\displaystyle P(t+1)} by selecting some individuals from P ( t ) {\displaystyle P(t)} and M ′ ( t ) {\displaystyle M'(t)} ; t = t + 1 {\displaystyle t=t+1} ; // Increment the generation counter end while Return best individual p ∈ P ( t − 1 ) {\displaystyle p\in P(t-1)} as result; There are some alternatives for this MA scheme. For example: All or some of the initial individuals may be improved by the meme(s). The parents may be locally improved instead of the offspring. Instead of all offspring, only a randomly selected or fitness-dependent fraction may undergo local improvement. The latter requires the evaluation of the offspring in M ′ ( t ) {\displaystyle M'(t)} prior to the Learning step. === 2nd generation === Multi-meme, hyper-heuristic and meta-Lamarckian MA are referred to as second generation MA exhibiting the principles of me

    Read more →
  • FreshBooks

    FreshBooks

    FreshBooks is accounting software operated by 2ndSite Inc. primarily for small and medium-sized businesses. It is a web-based software as a service (SaaS) model, that can be accessed through a desktop or mobile device. The company was founded in 2003 and is based in Toronto, Canada. == History == FreshBooks was founded in 2004 by Mike McDerment, Levi Cooperman, and Joe Sawada in Toronto, Ontario. McDerment incorporated a second company, BillSpring in January 2015 to work on new product development. It was rolled back into FreshBooks as an updated interface in 2016. Initially FreshBooks functioned like an electronic invoicing program targeting IT professionals. After the release of the new interface, the initial release of FreshBooks was referred to as "FreshBooks Classic." FreshBooks Classic was discontinued in 2022 after migrating users to the new platform. FreshBooks Classic's front-end application was built in PHP, and the backend services were built in Python while the new FreshBooks uses the same backend services with a JavaScript single-page application. == Product == FreshBooks is a subscription-based accounting software platform that provides features such as invoicing, accounts payable, expense and time tracking, retainers, fixed asset depreciation, purchase orders, payroll integrations, mileage tracking, double-entry accounting, and standard business reporting. Financial data is stored in the cloud on a unified ledger, enabling access from desktop and mobile devices. The platform includes a free API for integration with external applications and supports multiple tax rates and currencies. It also offers project management and payroll functionalities. Pricing is based on a recurring monthly fee. FreshBooks supports country-specific tax calculations, including GST and HST in Canada, sales taxes in the United States, and MTD compliance in the UK. == Operations == FreshBooks has its headquarters in Toronto, Canada with operations in North America, Europe and Australia. Founder Mike McDerment was the chief executive officer of the company from 2003 until 2021, when he stepped down and was replaced by Don Epperson, but stayed as the executive chair. Don Epperson had previously joined FreshBooks as executive director in 2019. == Funding == FreshBooks was initially self-funded. In 2014, the company raised a Series A venture investment of $30 million led by the venture capital firm Oak Investment Partners, with participation by Georgian Partners and Atlas Venture. In 2017, FreshBooks announced that it raised another $43 million in funding from Accomplice, Georgian Partners and Oak Investment Partners. On August 10, 2021, FreshBooks announced that it had secured $80.75 million in Series E funding and $50 million in debt financing. FreshBooks also reached a valuation of more than $1 billion.

    Read more →
  • Synaptic weight

    Synaptic weight

    In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research. == Computation == In a computational neural network, a vector or set of inputs x {\displaystyle {\textbf {x}}} and outputs y {\displaystyle {\textbf {y}}} , or pre- and post-synaptic neurons respectively, are interconnected with synaptic weights represented by the matrix w {\displaystyle w} , where for a linear neuron y j = ∑ i w i j x i or y = w x {\displaystyle y_{j}=\sum _{i}w_{ij}x_{i}~~{\textrm {or}}~~{\textbf {y}}=w{\textbf {x}}} . where the rows of the synaptic matrix represent the vector of synaptic weights for the output indexed by j {\displaystyle j} . The synaptic weight is changed by using a learning rule, the most basic of which is Hebb's rule, which is usually stated in biological terms as Neurons that fire together, wire together. Computationally, this means that if a large signal from one of the input neurons results in a large signal from one of the output neurons, then the synaptic weight between those two neurons will increase. The rule is unstable, however, and is typically modified using such variations as Oja's rule, radial basis functions or the backpropagation algorithm. == Biology == For biological networks, the effect of synaptic weights is not as simple as for linear neurons or Hebbian learning. However, biophysical models such as BCM theory have seen some success in mathematically describing these networks. In the mammalian central nervous system, signal transmission is carried out by interconnected networks of nerve cells, or neurons. For the basic pyramidal neuron, the input signal is carried by the axon, which releases neurotransmitter chemicals into the synapse which is picked up by the dendrites of the next neuron, which can then generate an action potential which is analogous to the output signal in the computational case. The synaptic weight in this process is determined by several variable factors: How well the input signal propagates through the axon (see myelination), The amount of neurotransmitter released into the synapse and the amount that can be absorbed in the following cell (determined by the number of AMPA and NMDA receptors on the cell membrane and the amount of intracellular calcium and other ions), The number of such connections made by the axon to the dendrites, How well the signal propagates and integrates in the postsynaptic cell. The changes in synaptic weight that occur is known as synaptic plasticity, and the process behind long-term changes (long-term potentiation and depression) is still poorly understood. Hebb's original learning rule was originally applied to biological systems, but has had to undergo many modifications as a number of theoretical and experimental problems came to light.

    Read more →
  • Logic learning machine

    Logic learning machine

    Logic learning machine (LLM) is a machine learning method based on the generation of intelligible rules. LLM is an efficient implementation of the Switching Neural Network (SNN) paradigm, developed by Marco Muselli, Senior Researcher at the Italian National Research Council CNR-IEIIT in Genoa. LLM has been employed in many different sectors, including the field of medicine (orthopedic patient classification, DNA micro-array analysis and Clinical Decision Support Systems), financial services and supply chain management. == History == The Switching Neural Network approach was developed in the 1990s to overcome the drawbacks of the most commonly used machine learning methods. In particular, black box methods, such as multilayer perceptron and support vector machine, had good accuracy but could not provide deep insight into the studied phenomenon. On the other hand, decision trees were able to describe the phenomenon but often lacked accuracy. Switching Neural Networks made use of Boolean algebra to build sets of intelligible rules able to obtain very good performance. In 2014, an efficient version of Switching Neural Network was developed and implemented in the Rulex suite with the name Logic Learning Machine. Also, an LLM version devoted to regression problems was developed. == General == Like other machine learning methods, LLM uses data to build a model able to perform a good forecast about future behaviors. LLM starts from a table including a target variable (output) and some inputs and generates a set of rules that return the output value y {\displaystyle y} corresponding to a given configuration of inputs. A rule is written in the form: if premise then consequence where consequence contains the output value whereas premise includes one or more conditions on the inputs. According to the input type, conditions can have different forms: for categorical variables the input value must be in a given subset: x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} . for ordered variables the condition is written as an inequality or an interval: x 2 ≤ α {\displaystyle x_{2}\leq \alpha } or β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } A possible rule is therefore in the form if x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} AND x 2 ≤ α {\displaystyle x_{2}\leq \alpha } AND β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } then y = y ¯ {\displaystyle y={\bar {y}}} == Types == According to the output type, different versions of the Logic Learning Machine have been developed: Logic Learning Machine for classification, when the output is a categorical variable, which can assume values in a finite set Logic Learning Machine for regression, when the output is an integer or real number.

    Read more →
  • Online machine learning

    Online machine learning

    In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Online machine learning algorithms find applications in a wide variety of fields such as sponsored search to maximize ad revenue, portfolio optimization, shortest path prediction (with stochastic weights, e.g. traffic on roads for a maps application), spam filtering, real-time fraud detection, dynamic pricing for e-commerce, etc. There is also growing interest in usage of online learning paradigms for LLMs to enable continuous, real-time adaptation after the initial training. == Introduction == In the setting of supervised learning, a function of f : X → Y {\displaystyle f:X\to Y} is to be learned, where X {\displaystyle X} is thought of as a space of inputs and Y {\displaystyle Y} as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p ( x , y ) {\displaystyle p(x,y)} on X × Y {\displaystyle X\times Y} . In reality, the learner never knows the true distribution p ( x , y ) {\displaystyle p(x,y)} over instances. Instead, the learner usually has access to a training set of examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} . In this setting, the loss function is given as V : Y × Y → R {\displaystyle V:Y\times Y\to \mathbb {R} } , such that V ( f ( x ) , y ) {\displaystyle V(f(x),y)} measures the difference between the predicted value f ( x ) {\displaystyle f(x)} and the true value y {\displaystyle y} . The ideal goal is to select a function f ∈ H {\displaystyle f\in {\mathcal {H}}} , where H {\displaystyle {\mathcal {H}}} is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms. == Statistical view of online learning == In statistical learning models, the training sample ( x i , y i ) {\displaystyle (x_{i},y_{i})} are assumed to have been drawn from the true distribution p ( x , y ) {\displaystyle p(x,y)} and the objective is to minimize the expected "risk" I [ f ] = E [ V ( f ( x ) , y ) ] = ∫ V ( f ( x ) , y ) d p ( x , y ) . {\displaystyle I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .} A common paradigm in this situation is to estimate a function f ^ {\displaystyle {\hat {f}}} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input ( x t + 1 , y t + 1 ) {\displaystyle (x_{t+1},y_{t+1})} , the current best predictor f t {\displaystyle f_{t}} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where f t + 1 {\displaystyle f_{t+1}} is permitted to depend on f t {\displaystyle f_{t}} and all previous data points ( x 1 , y 1 ) , … , ( x t , y t ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{t},y_{t})} . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques. A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of b ≥ 1 {\displaystyle b\geq 1} data points at a time, this can be considered as pseudo-online learning for b {\displaystyle b} much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks. === Example: linear least squares === The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions. === Batch learning === Consider the setting of supervised learning with f {\displaystyle f} being a linear function to be learned: f ( x j ) = ⟨ w , x j ⟩ = w ⋅ x j {\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}} where x j ∈ R d {\displaystyle x_{j}\in \mathbb {R} ^{d}} is a vector of inputs (data points) and w ∈ R d {\displaystyle w\in \mathbb {R} ^{d}} is a linear filter vector. The goal is to compute the filter vector w {\displaystyle w} . To this end, a square loss function V ( f ( x j ) , y j ) = ( f ( x j ) − y j ) 2 = ( ⟨ w , x j ⟩ − y j ) 2 {\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}} is used to compute the vector w {\displaystyle w} that minimizes the empirical loss I n [ w ] = ∑ j = 1 n V ( ⟨ w , x j ⟩ , y j ) = ∑ j = 1 n ( x j T w − y j ) 2 {\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}} where y j ∈ R . {\displaystyle y_{j}\in \mathbb {R} .} Let X {\displaystyle X} be the i × d {\displaystyle i\times d} data matrix and y ∈ R i {\displaystyle y\in \mathbb {R} ^{i}} is the column vector of target values after the arrival of the first i {\displaystyle i} data points. Assuming that the covariance matrix Σ i = X T X {\displaystyle \Sigma _{i}=X^{\mathsf {T}}X} is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution f ∗ ( x ) = ⟨ w ∗ , x ⟩ {\displaystyle f^{}(x)=\langle w^{},x\rangle } to the linear least squares problem is given by w ∗ = ( X T X ) − 1 X T y = Σ i − 1 ∑ j = 1 i x j y j . {\displaystyle w^{}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.} Now, calculating the covariance matrix Σ i = ∑ j = 1 i x j x j T {\displaystyle \Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}} takes time O ( i d 2 ) {\displaystyle O(id^{2})} , inverting the d × d {\displaystyle d\times d} matrix takes time O ( d 3 ) {\displaystyle O(d^{3})} , while the rest of the multiplication takes time O ( d 2 ) {\displaystyle O(d^{2})} , giving a total time of O ( i d 2 + d 3 ) {\displaystyle O(id^{2}+d^{3})} . When there are n {\displaystyle n} total points in the dataset, to recompute the solution after the arrival of every datapoint i = 1 , … , n {\displaystyle i=1,\ldots ,n} , the naive approach will have a total complexity O ( n 2 d 2 + n d 3 ) {\displaystyle O(n^{2}d^{2}+nd^{3})} . Note that when storing the matrix Σ i {\displaystyle \Sigma _{i}} , then updating it at each step needs only adding x i + 1 x i + 1 T {\displaystyle x_{i+1}x_{i+1}^{\mathsf {T}}} , which takes O ( d 2 ) {\displaystyle O(d^{2})} time, reducing the total time to O ( n d 2 + n d 3 ) = O ( n d 3 ) {\displaystyle O(nd^{2}+nd^{3})=O(nd^{3})} , but with an additional storage space of O ( d 2 ) {\displaystyle O(d^{2})} to store Σ i {\displaystyle \Sigma _{i}} . === Online learning: recursive least squares === The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising w 0 = 0 ∈ R d {\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d}} and Γ 0 = I ∈ R d × d {\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}} , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: Γ i = Γ i − 1 − Γ i − 1 x i x i T Γ i − 1 1 + x i T Γ i − 1 x i {\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}} w i = w i − 1 − Γ i x i ( x i T w i − 1 − y i ) {\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{

    Read more →
  • Mojito (framework)

    Mojito (framework)

    Mojito is an environment agnostic, Model-View-Controller (MVC) web application framework. It was designed by Yahoo. == Features == Mojito supports agile development of web applications. Mojito has built-in support for unit testing, Internationalization, syntax and coding convention checks. Both server and client components are written in JavaScript. Mojito allows developers designing web applications to leverage the utilities of both configuration and MVC framework. Mojito is capable of running on both JavaScript-enabled web browsers and servers using Node.js because they both utilize JavaScript. Mojito applications mainly consist of two components: JSON Configuration files: these define relationships between code components, assets, routing paths, and framework defaults and are available at the application and mojit level. Directories: these reflect MVC architecture and are used to separate resources such as assets, libraries, middleware, etc. == Architecture == In Mojito, both server and "client" side scripting is done in JavaScript, allowing it to run on both client and server thereby breaking the "front-end back-end barrier." It has both client and server runtimes. === Server runtime === This block houses operations needed by server side components. Services include: Routing rules, HTTP Server, config loader and disk-based loader. === Client runtime === This block houses operations called upon while running client sides components. Services include local storage/cache access and JSON based /URL based loader === Core === Core function can be accessed on client or server. Services include Registry, Dispatcher, Front controller, Resource store. === Container === mojit object comes into the picture. This container also include the services used by mojits. API and Mojito services are the blocks which caters to services needed for execution of mojits. === API (Action Context) === Mojito services are a customizable service block. It offers mojits a range of services which might be needed by mojit to carry out certain actions. These services can be availed at both client and server side. Reusable services can be created and aggregated to the core here. == Mojits == Mojits are the modules of a Mojito application. An application consists of one or more mojits. A mojit encompasses a Model, Views and a Controller defined by JSON configuration files. It includes a View factory where views are created according to the model and a View cache that holds frequently requested views to aid performance. === Application Architecture === A Mojito application is a set of mojits facilitated by configurable JSON files which define the code for model, view and controller. This MVC structure works with API block and Mojito services, and can be deployed at both client and server side. While the application is deployed at client side, it can call server-side modules using binders. Binders are mojit codes that let mojits request services from each other. Mojit Proxy acts as an intermediary between binders and mojit's API (application context) block and other mojits. Controllers are command-issuing units of mojits. Models mirror the core logic and hold data. Applications can have multiple models. They can be centrally accessed from controllers. View files are created in accordance with controllers and models, and are marked-up before they are sent to users as output. === Application Directory Structure === Directory structure of a Mojito application with one mojit: [mojito_app]/ |-- application.json |-- assets/ | `-- favicon.icon |-- yui_modules/ | `-- .{affinity}.js |-- index.js |-- mojits/ | `-- [mojit_name | |-- assets/ | |-- yui_modules/ | | `-- .{affinity}.js | |-- binders/ | | `-- {view_name}.js | |-- controller.{affinity}.js | |-- defaults.json | |-- definition.json | |-- lang/ | | `-- {mojit_name}_{lang}.js | |-- models/ | | `-- {model_name}.{affinity}.js | |-- tests/ | | |-- yui_modules/ | | | `-- {module_name}.{affinity}-tests.js | | |-- controller.{affinity}-tests.js | | `-- models/ | | `-- {model_name}.{affinity}-tests.js | `-- views/ | |-- {view_name}.{view_engine}.html | `-- {view_name}.{device}.{view_engine}.html |-- package.json |-- routes.json (deprecated) |-- server.js == Model, View and Controller == The Model hosts data, which is accessed by the Controller and presented to the View. Controller also handles any client requests for data, in which case controller fetches data from the model and passes the data to the client. All three components are clustered in the mojit. Mojits are physically illustrated by directory structures and an application can have multiple mojits. Every mojit can have one controller, one or more views and zero or more models. === Model === The model it represents the application data and is independent of view or controller. Model contains code to manipulate the data. They are found in the models directory of each mojit. Functions include: Storing information for access by controller. Validation and error handling. Metadata required by the view === Controller === The controller acts like a connecting agent between model and view. It supplies input to Model and after fetching data from model, passes it to View. Functions include Redirection Monitors authentication Web safety Encoding === View === The view acts as a presentation filter by highlighting some model attributes and suppressing others. A view can be understood as a visual permutation of the model. The view renders data received from controller and displays it to the end user.

    Read more →
  • Autoencoder

    Autoencoder

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →