The following outline is provided as an overview of, and topical guide to, machine learning: Machine learning (ML) is a subfield of artificial intelligence within computer science that evolved from the study of pattern recognition and computational learning theory. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". ML involves the study and construction of algorithms that can learn from and make predictions on data. These algorithms operate by building a model from a training set of example observations to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. == How can machine learning be categorized? == An academic discipline A branch of science An applied science A subfield of computer science A branch of artificial intelligence A subfield of soft computing Application of statistics === Paradigms of machine learning === Supervised learning, where the model is trained on labeled data Unsupervised learning, where the model tries to identify patterns in unlabeled data Reinforcement learning, where the model learns to make decisions by receiving rewards or penalties. == Applications of machine learning == Applications of machine learning Bioinformatics Biomedical informatics Computer vision Customer relationship management Data mining Earth sciences Email filtering Inverted pendulum (balance and equilibrium system) Natural language processing Named Entity Recognition Automatic summarization Automatic taxonomy construction Dialog system Grammar checker Language recognition Handwriting recognition Optical character recognition Speech recognition Text to Speech Synthesis Speech Emotion Recognition Machine translation Question answering Speech synthesis Text mining Term frequency–inverse document frequency Text simplification Pattern recognition Facial recognition system Handwriting recognition Image recognition Optical character recognition Speech recognition Recommendation system Collaborative filtering Content-based filtering Hybrid recommender systems Search engine Search engine optimization Social engineering == Machine learning hardware == Graphics processing unit Tensor processing unit Vision processing unit == Machine learning tools == Comparison of machine learning software Comparison of deep learning software === Machine learning frameworks === ==== Proprietary machine learning frameworks ==== Amazon Machine Learning Microsoft Azure Machine Learning Studio DistBelief (replaced by TensorFlow) ==== Open source machine learning frameworks ==== Apache Singa Apache MXNet Caffe PyTorch mlpack TensorFlow Torch CNTK Accord.Net Jax MLJ.jl – A machine learning framework for Julia === Machine learning libraries === Deeplearning4j Theano scikit-learn Keras === Machine learning algorithms === == Machine learning methods == === Instance-based algorithm === K-nearest neighbors algorithm (KNN) Learning vector quantization (LVQ) Self-organizing map (SOM) === Regression analysis === Logistic regression Ordinary least squares regression (OLSR) Linear regression Stepwise regression Multivariate adaptive regression splines (MARS) Regularization algorithm Ridge regression Least Absolute Shrinkage and Selection Operator (LASSO) Elastic net Least-angle regression (LARS) Classifiers Probabilistic classifier Naive Bayes classifier Binary classifier Linear classifier Hierarchical classifier === Dimensionality reduction === Dimensionality reduction Canonical correlation analysis (CCA) Factor analysis Feature extraction Feature selection Independent component analysis (ICA) Linear discriminant analysis (LDA) Multidimensional scaling (MDS) Non-negative matrix factorization (NMF) Partial least squares regression (PLSR) Principal component analysis (PCA) Principal component regression (PCR) Projection pursuit Sammon mapping t-distributed stochastic neighbor embedding (t-SNE) === Ensemble learning === Ensemble learning AdaBoost Boosting Bootstrap aggregating (also "bagging" or "bootstrapping") Ensemble averaging Gradient boosted decision tree (GBDT) Gradient boosting Random Forest Stacked Generalization === Meta-learning === Meta-learning Inductive bias Metadata === Reinforcement learning === Reinforcement learning Q-learning State–action–reward–state–action (SARSA) Temporal difference learning (TD) Learning Automata === Supervised learning === Supervised learning Averaged one-dependence estimators (AODE) Artificial neural network Case-based reasoning Gaussian process regression Gene expression programming Group method of data handling (GMDH) Inductive logic programming Instance-based learning Lazy learning Learning Automata Learning Vector Quantization Logistic Model Tree Minimum message length (decision trees, decision graphs, etc.) Nearest Neighbor Algorithm Analogical modeling Probably approximately correct learning (PAC) learning Ripple down rules, a knowledge acquisition methodology Symbolic machine learning algorithms Support vector machines Random Forests Ensembles of classifiers Bootstrap aggregating (bagging) Boosting (meta-algorithm) Ordinal classification Conditional Random Field ANOVA Quadratic classifiers k-nearest neighbor Boosting SPRINT Bayesian networks Naive Bayes Hidden Markov models Hierarchical hidden Markov model ==== Bayesian ==== Bayesian statistics Bayesian knowledge base Naive Bayes Gaussian Naive Bayes Multinomial Naive Bayes Averaged One-Dependence Estimators (AODE) Bayesian Belief Network (BBN) Bayesian Network (BN) ==== Decision tree algorithms ==== Decision tree algorithm Decision tree Classification and regression tree (CART) Iterative Dichotomiser 3 (ID3) C4.5 algorithm C5.0 algorithm Chi-squared Automatic Interaction Detection (CHAID) Decision stump Conditional decision tree ID3 algorithm Random forest SLIQ ==== Linear classifier ==== Linear classifier Fisher's linear discriminant Linear regression Logistic regression Multinomial logistic regression Naive Bayes classifier Perceptron Support vector machine === Unsupervised learning === Unsupervised learning Expectation-maximization algorithm Vector Quantization Generative topographic map Information bottleneck method Association rule learning algorithms Apriori algorithm Eclat algorithm ==== Artificial neural networks ==== Artificial neural network Feedforward neural network Extreme learning machine Convolutional neural network Recurrent neural network Long short-term memory (LSTM) Logic learning machine Self-organizing map ==== Association rule learning ==== Association rule learning Apriori algorithm Eclat algorithm FP-growth algorithm ==== Hierarchical clustering ==== Hierarchical clustering Single-linkage clustering Conceptual clustering ==== Cluster analysis ==== Cluster analysis BIRCH DBSCAN Expectation–maximization (EM) Fuzzy clustering Hierarchical clustering k-means clustering k-medians Mean-shift OPTICS algorithm ==== Anomaly detection ==== Anomaly detection k-nearest neighbors algorithm (k-NN) Local outlier factor === Semi-supervised learning === Semi-supervised learning Active learning Generative models Low-density separation Graph-based methods Co-training Transduction === Deep learning === Deep learning Deep belief networks Deep Boltzmann machines Deep Convolutional neural networks Deep Recurrent neural networks Hierarchical temporal memory Generative Adversarial Network Style transfer Transformer Stacked Auto-Encoders === Other machine learning methods and problems === Anomaly detection Association rules Bias-variance dilemma Classification Multi-label classification Clustering Data Pre-processing Empirical risk minimization Feature engineering Feature learning Learning to rank Occam learning Online machine learning PAC learning Regression Reinforcement Learning Semi-supervised learning Statistical learning Structured prediction Graphical models Bayesian network Conditional random field (CRF) Hidden Markov model (HMM) Unsupervised learning VC theory == Machine learning research == List of artificial intelligence projects List of datasets for machine learning research == History of machine learning == History of machine learning Timeline of machine learning == Machine learning projects == Machine learning projects: DeepMind Google Brain OpenAI Meta AI Hugging Face == Machine learning organizations == === Machine learning conferences and workshops === Artificial Intelligence and Security (AISec) (co-located workshop with CCS) Conference on Neural Information Processing Systems (NIPS) ECML PKDD International Conference on Machine Learning (ICML) ML4ALL (Machine Learning For All) == Machine learning publications == === Books on machine learning === Mathematics for Machine Learning Hands-On Machine Learning Scikit-Learn, Keras, and TensorFlow The Hundred-Page Machine Learning Book === Machine learning journals === Machine Learning Journal of Machine Learning Research (JMLR) Neural Computation == Pe
Pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable physical element of a raster image or the smallest controllable element of a display device or dot matrix printer. Pixels are arranged in a regular, two-dimensional grid, and each pixel serves as a sample of an original image, with a greater number of samples typically providing more accurate representations. Each pixel possesses a specific intensity or color, often composed of three or four component intensities, such as red, green, and blue (RGB), or cyan, magenta, yellow, and black (CMYK). The intensity of each pixel is variable, and in color imaging systems, these components are combined to produce a wide spectrum of colors. The concept of a picture element has existed since the early days of television, appearing as "Bildpunkt" in a 1888 German patent, and the term "pixel" has been used in various U.S. patents since 1911. In most digital display devices, pixels are the smallest element that can be manipulated through software. Each pixel is a sample of an original image; more samples typically provide more accurate representations of the original. The intensity of each pixel is variable. In color imaging systems, a color is typically represented by three or four component intensities such as red, green, and blue, or cyan, magenta, yellow, and black. In some contexts (such as descriptions of camera sensors), pixel refers to a single scalar element of a multi-component representation (called a photosite in the camera sensor context, although sensel 'sensor element' is sometimes used), while in yet other contexts (like MRI) it may refer to a set of component intensities for a spatial position. Software on early consumer computers was necessarily rendered at a low resolution, with large pixels visible to the naked eye; graphics made under these limitations may be called pixel art, especially in reference to video games. Modern computers and displays, however, can easily render orders of magnitude more pixels than was previously possible, necessitating the use of large measurements like the megapixel (one million pixels). == Etymology == The word pixel is a combination of pix (from "pictures", shortened to "pics") and el (for "element"); similar formations with 'el' include the words voxel 'volume pixel', and texel 'texture pixel'. The word pix appeared in Variety magazine headlines in 1932, as an abbreviation for the word pictures, in reference to movies. By 1938, "pix" was being used in reference to still pictures by photojournalists. The word "pixel" was first published in 1965 by Frederic C. Billingsley of JPL, to describe the picture elements of scanned images from space probes to the Moon and Mars. Billingsley had learned the word from Keith E. McFarland, at the Link Division of General Precision in Palo Alto, who in turn said he did not know where it originated. McFarland said simply it was "in use at the time" (c. 1963). The concept of a "picture element" dates to the earliest days of television, for example as "Bildpunkt" (the German word for pixel, literally 'picture point') in the 1888 German patent of Paul Nipkow. According to various etymologies, the earliest publication of the term picture element itself was in Wireless World magazine in 1927, though it had been used earlier in various U.S. patents filed as early as 1911. Some authors explain pixel as picture cell, as early as 1972. In graphics and in image and video processing, pel is often used instead of pixel. For example, IBM used it in their Technical Reference for the original PC. Pixilation, spelled with a second i, is an unrelated filmmaking technique that dates to the beginnings of cinema, in which live actors are posed frame by frame and photographed to create stop-motion animation. An archaic British word meaning "possession by spirits (pixies)", the term has been used to describe the animation process since the early 1950s; various animators, including Norman McLaren and Grant Munro, are credited with popularizing it. == Technical == A pixel is generally thought of as the smallest single component of a digital image. However, the definition is highly context-sensitive. For example, there can be "printed pixels" in a page, or pixels carried by electronic signals, or represented by digital values, or pixels on a display device, or pixels in a digital camera (photosensor elements). This list is not exhaustive and, depending on context, synonyms include pel, sample, byte, bit, dot, and spot. Pixels can be used as a unit of measure such as: 2400 pixels per inch, 640 pixels per line, or spaced 10 pixels apart. The measures "dots per inch" (dpi) and "pixels per inch" (ppi) are sometimes used interchangeably, but have distinct meanings, especially for printer devices, where dpi is a measure of the printer's density of dot (e.g. ink droplet) placement. For example, a high-quality photographic image may be printed with 600 ppi on a 1200 dpi inkjet printer. Even higher dpi numbers, such as the 4800 dpi quoted by printer manufacturers since 2002, do not mean much in terms of achievable resolution. The more pixels used to represent an image, the closer the result can resemble the original. The number of pixels in an image is sometimes called the resolution, though resolution has a more specific definition. Pixel counts can be expressed as a single number, as in a "three-megapixel" digital camera, which has a nominal three million pixels, or as a pair of numbers, as in a "640 by 480 display", which has 640 pixels from side to side and 480 from top to bottom (as in a VGA display) and therefore has a total number of 640 × 480 = 307,200 pixels, or 0.3 megapixels. The pixels, or color samples, that form a digitized image (such as a JPEG file used on a web page) may or may not be in one-to-one correspondence with screen pixels, depending on how a computer displays an image. In computing, an image composed of pixels is known as a bitmapped image or a raster image. The word raster originates from television scanning patterns, and has been widely used to describe similar halftone printing and storage techniques. === Sampling patterns === For convenience, pixels are normally arranged in a regular two-dimensional grid. By using this arrangement, many common operations can be implemented by uniformly applying the same operation to each pixel independently. Other arrangements of pixels are possible, with some sampling patterns even changing the shape (or kernel) of each pixel across the image. For this reason, care must be taken when acquiring an image on one device and displaying it on another, or when converting image data from one pixel format to another. For example: Liquid-crystal displays (LCDs) typically use a staggered grid, where the red, green, and blue components are sampled at slightly different locations. Subpixel rendering is a technology which takes advantage of these differences to improve the rendering of text on LCD screens. The vast majority of color digital cameras use a Bayer filter, resulting in a regular grid of pixels where the color of each pixel depends on its position on the grid. A clipmap uses a hierarchical sampling pattern, where the size of the support of each pixel depends on its location within the hierarchy. Warped grids are used when the underlying geometry is non-planar, such as images of the earth from space. The use of non-uniform grids is an active research area, attempting to bypass the traditional Nyquist limit. Pixels on computer monitors are normally "square" (that is, have equal horizontal and vertical sampling pitch); pixels in other systems are often "rectangular" (that is, have unequal horizontal and vertical sampling pitch – oblong in shape), as are digital video formats with diverse aspect ratios, such as the anamorphic widescreen formats of the Rec. 601 digital video standard. === Resolution of computer monitors === Computer monitors (and TV sets) generally have a fixed native resolution. What it is depends on the monitor, and size. See below for historical exceptions. Computers can use pixels to display an image, often an abstract image that represents a GUI. The resolution of this image is called the display resolution and is determined by the video card of the computer. Flat-panel monitors (and TV sets), e.g. OLED or LCD monitors, or E-ink, also use pixels to display an image, and have a native resolution, and it should (ideally) be matched to the video card resolution. Each pixel is made up of triads, with the number of these triads determining the native resolution. On older, historically available, CRT monitors the resolution was possibly adjustable (still lower than what modern monitor achieve), while on some such monitors (or TV sets) the beam sweep rate was fixed, resulting in a fixed native resolution. Most CRT monitors do not have a fixed beam sweep rate, meaning they do not have a native resolution at all – instead they
Joseph Nechvatal
Joseph Nechvatal (born January 15, 1951) is an American post-conceptual digital artist and art theoretician who creates computer-assisted paintings and computer animations, often using custom computer viruses. == Life and work == Joseph Nechvatal was born in Chicago. He studied fine art and philosophy at Southern Illinois University Carbondale, Cornell University, and Columbia University. He earned a Doctor of Philosophy in Philosophy of Art and Technology at the Planetary Collegium at University of Wales, Newport and has taught art theory and art history at the School of Visual Arts. He has had many solo exhibitions and is one of five artists that art historian Patrick Frank examines in his 2024 book Art of the 1980s: As If the Digital Mattered. His work in the late 1970s and early 1980s chiefly consisted of postminimal gray palimpsest-like drawings that were often photo-mechanically enlarged. Beginning in 1979 he became associated with the artist group Colab, organized the Public Arts International/Free Speech series, and helped established the non-profit group ABC No Rio. In 1983 he co-founded the avant-garde electronic art music audio project Tellus Audio Cassette Magazine. In 1984, Nechvatal began work on an opera called XS: The Opera Opus (1984-6) with the no wave musical composer Rhys Chatham. He began using computers and robotics to make post-conceptual paintings in 1986 and later, in his signature work, began to employ self-created computer viruses. From 1991 to 1993, he was artist-in-residence at the Louis Pasteur Atelier in Arbois, France and at the Saline Royale/Ledoux Foundation's computer lab. There he worked on The Computer Virus Project, his first artistic experiment with computer viruses and computer virus animation. He exhibited computer-robotic paintings at Documenta 8 in 1987. In 2002 he extended his experimentation into viral artificial life through a collaboration with the programmer Stephane Sikora of music2eye in a work called the Computer Virus Project II. Nechvatal has also created a noise music work called viral symphOny, a collaborative sound symphony created by using his computer virus software at the Institute for Electronic Arts at Alfred University. In 2021 Pentiments released Nechvatal's retrospective audio cassette called Selected Sound Works (1981-2021) and in 2022 his The Viral Tempest, a double vinyl LP of new audio work. In 2025, he joined the roster of artists/musicians at Table of the Elements with two CD/book releases: Selected Sound Works (1981-2021) and The Marriage of Orlando and Artaud, Even. From 1999 to 2013, Nechvatal taught art theories of immersive virtual reality and the viractual at the School of Visual Arts in New York City (SVA). A book of his collected essays entitled Towards an Immersive Intelligence: Essays on the Work of Art in the Age of Computer Technology and Virtual Reality (1993–2006) was published by Edgewise Press in 2009. Also in 2009, his virtual reality art theory and art history book Immersive Ideals / Critical Distances was published. In 2011, his book Immersion Into Noise was published by Open Humanities Press in conjunction with the University of Michigan Library's Scholarly Publishing Office. Nechvatal has also published three books with Punctum Books: Minóy (noise music—ed.—2014), Destroyer of Naivetés (poetry—2015), and Styling Sagaciousness (poetry—2022). In 2023 his art theory cybersex farce novella venus©~Ñ~vibrator, even was published by Orbis Tertius Press The Joseph Nechvatal archive is housed at The Fales Library Downtown Collection at the NYU Special Collections Library in New York City. === Viractualism === Viractualism is an art theory concept developed by Nechvatal in 1999 from Ph.D. research Nechvatal conducted at the Planetary Collegium at University of Wales, Newport. There he developed his concept of the viractual, which strives to create an interface between the actual and the virtual.
Junction tree algorithm
The junction tree algorithm (also known as 'Clique Tree') is a method used in machine learning to extract marginalization in general graphs. In essence, it entails performing belief propagation on a modified graph called a junction tree. The graph is called a tree because it branches into different sections of data; nodes of variables are the branches. The basic premise is to eliminate cycles by clustering them into single nodes. Multiple extensive classes of queries can be compiled at the same time into larger structures of data. There are different algorithms to meet specific needs and for what needs to be calculated. Inference algorithms gather new developments in the data and calculate it based on the new information provided. == Junction tree algorithm == === Hugin algorithm === If the graph is directed then moralize it to make it un-directed. Introduce the evidence. Triangulate the graph to make it chordal. Construct a junction tree from the triangulated graph (we will call the vertices of the junction tree "supernodes"). Propagate the probabilities along the junction tree (via belief propagation) Note that this last step is inefficient for graphs of large treewidth. Computing the messages to pass between supernodes involves doing exact marginalization over the variables in both supernodes. Performing this algorithm for a graph with treewidth k will thus have at least one computation which takes time exponential in k. It is a message passing algorithm. The Hugin algorithm takes fewer computations to find a solution compared to Shafer-Shenoy. === Shafer-Shenoy algorithm === Computed recursively Multiple recursions of the Shafer-Shenoy algorithm results in Hugin algorithm Found by the message passing equation Separator potentials are not stored The Shafer-Shenoy algorithm is the sum product of a junction tree. It is used because it runs programs and queries more efficiently than the Hugin algorithm. The algorithm makes calculations for conditionals for belief functions possible. Joint distributions are needed to make local computations happen. === Underlying theory === The first step concerns only Bayesian networks, and is a procedure to turn a directed graph into an undirected one. We do this because it allows for the universal applicability of the algorithm, regardless of direction. The second step is setting variables to their observed value. This is usually needed when we want to calculate conditional probabilities, so we fix the value of the random variables we condition on. Those variables are also said to be clamped to their particular value. The third step is to ensure that graphs are made chordal if they aren't already chordal. This is the first essential step of the algorithm. It makes use of the following theorem: Theorem: For an undirected graph, G, the following properties are equivalent: Graph G is triangulated. The clique graph of G has a junction tree. There is an elimination ordering for G that does not lead to any added edges. Thus, by triangulating a graph, we make sure that the corresponding junction tree exists. A usual way to do this, is to decide an elimination order for its nodes, and then run the Variable elimination algorithm. The variable elimination algorithm states that the algorithm must be run each time there is a different query. This will result to adding more edges to the initial graph, in such a way that the output will be a chordal graph. All chordal graphs have a junction tree. The next step is to construct the junction tree. To do so, we use the graph from the previous step, and form its corresponding clique graph. Now the next theorem gives us a way to find a junction tree: Theorem: Given a triangulated graph, weight the edges of the clique graph by their cardinality, |A∩B|, of the intersection of the adjacent cliques A and B. Then any maximum-weight spanning tree of the clique graph is a junction tree. So, to construct a junction tree we just have to extract a maximum weight spanning tree out of the clique graph. This can be efficiently done by, for example, modifying Kruskal's algorithm. The last step is to apply belief propagation to the obtained junction tree. Usage: A junction tree graph is used to visualize the probabilities of the problem. The tree can become a binary tree to form the actual building of the tree. A specific use could be found in auto encoders, which combine the graph and a passing network on a large scale automatically. === Inference Algorithms === Loopy belief propagation: A different method of interpreting complex graphs. The loopy belief propagation is used when an approximate solution is needed instead of the exact solution. It is an approximate inference. Cutset conditioning: Used with smaller sets of variables. Cutset conditioning allows for simpler graphs that are easier to read but are not exact.
Linear classifier
In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. A simpler definition is to say that a linear classifier is one whose decision boundaries are linear. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. == Definition == If the input feature vector to the classifier is a real vector x → {\displaystyle {\vec {x}}} , then the output score is y = f ( w → ⋅ x → ) = f ( ∑ j w j x j ) , {\displaystyle y=f({\vec {w}}\cdot {\vec {x}})=f\left(\sum _{j}w_{j}x_{j}\right),} where w → {\displaystyle {\vec {w}}} is a real vector of weights and f is a function that converts the dot product of the two vectors into the desired output. (In other words, w → {\displaystyle {\vec {w}}} is a one-form or linear functional mapping x → {\displaystyle {\vec {x}}} onto R.) The weight vector w → {\displaystyle {\vec {w}}} is learned from a set of labeled training samples. Often f is a threshold function, which maps all values of w → ⋅ x → {\displaystyle {\vec {w}}\cdot {\vec {x}}} above a certain threshold to the first class and all other values to the second class; e.g., f ( x ) = { 1 if w T ⋅ x > θ , 0 otherwise {\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} ^{T}\cdot \mathbf {x} >\theta ,\\0&{\text{otherwise}}\end{cases}}} The superscript T indicates the transpose and θ {\displaystyle \theta } is a scalar threshold. A more complex f might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when x → {\displaystyle {\vec {x}}} is sparse. Also, linear classifiers often work very well when the number of dimensions in x → {\displaystyle {\vec {x}}} is large, as in document classification, where each element in x → {\displaystyle {\vec {x}}} is typically the number of occurrences of a word in a document (see document-term matrix). In such cases, the classifier should be well-regularized. == Generative models vs. discriminative models == There are two broad classes of methods for determining the parameters of a linear classifier w → {\displaystyle {\vec {w}}} . They can be generative and discriminative models. Methods of the former model joint probability distribution, whereas methods of the latter model conditional density functions P ( c l a s s | x → ) {\displaystyle P({\rm {class}}|{\vec {x}})} . Examples of such algorithms include: Linear Discriminant Analysis (LDA)—assumes Gaussian conditional density models Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: Logistic regression—maximum likelihood estimation of w → {\displaystyle {\vec {w}}} assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. Perceptron—an algorithm that attempts to fix all errors encountered in the training set Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. Support vector machine—an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: principal components analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space φ ( x → ) {\displaystyle \varphi ({\vec {x}})} , using the kernel trick. === Discriminative training === Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an optimization algorithm that is given a training set with desired outputs and a loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form arg min w R ( w ) + C ∑ i = 1 N L ( y i , w T x i ) {\displaystyle {\underset {\mathbf {w} }{\arg \min }}\;R(\mathbf {w} )+C\sum _{i=1}^{N}L(y_{i},\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i})} where w is a vector of classifier parameters, L(yi, wTxi) is a loss function that measures the discrepancy between the classifier's prediction and the true output yi for the i'th training example, R(w) is a regularization function that prevents the parameters from getting too large (causing overfitting), and C is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function R is convex, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (stochastic) gradient descent, L-BFGS, coordinate descent and Newton methods.
Ogle app
Ogle is a free smartphone based social media application. It is available for iOS and Android. Ogle acts like a school wide forum that lets users and users' classmates share and interact. Users can share photos, videos, questions, even thoughts and watch submissions grow in popularity as other users vote and comment on them. == App Features == Campus Feed: Interact by watching and posting videos or pictures to your campus story. Photos and Videos: share what you want with many different timing options. Interact: Chat with friends and groups, or share a moment for all to see. Real-name system: choose to register an account with username and profile picture. Custom Stickers: Create stickers to add creativity and zest to your pictures. Flash Interaction: All private chat and group chat history will be deleted after 24 hours on Ogle Chat. == Controversies == Users can post anything on Ogle using text, photos, and videos. As a result, some Ogle user's sense of anonymity, posts have targeted specific schools and students with abusive and hurtful content. The Ogle app's user anonymity makes it difficult for school officials to quickly investigate issues that occur within the Ogle app. On March 28, 2016, three people were arrested after violent threats were made against an Anaheim high school. 18-year-old Miguel Meza was arrested Sunday afternoon during a traffic stop, along with his passenger, 23-year-old Johnny Aguilar. Police said both men had loaded handguns. Aguilar was also accused of violating his probation. "It is concerning the fact that they did have firearms, but we don't have a crystal ball. We can't determine if they possessed those firearms to engage in some kind of school violence or if they had it for another reason," Sgt. Daron Wyatt with the Anaheim Police Department said. Officials said Meza and Aguilar have known gang ties and detectives began investigating Meza after threats were made against the school on Ogle. On February 29, 2016, Santa Cruz County sheriff's deputies arrested a 16-year-old Aptos High School student Friday, accused of making an online threat of gun violence at Aptos High and Monte Vista Christian."He basically told detectives that it was all a joke. It's not a joke. You have multiple resources being spent to investigate these cases," said Santa Cruz County Sheriff's Sgt. Roy Morales. The schools remained open throughout the week, with a huge police presence on campus. In an anonymous emailed statement to the Daily Pilot on Thursday, the "Ogle team" said: "We are aware of the concern, and cyberbullying is absolutely NOT our intention for the app. Our goal for this app is to create a free and safe community space for students, for a better communication. We are currently working around the clock to improve the app. As a matter of fact, we are also in contact with local police departments, anti-bullying organizations and local high schools to try to help the students." In response to these incidents, Ogle expressed that they takes the safety of its users seriously and does not condone any type of behavior that is illegal or in violation of its content policies. The company also said it has instituted a content moderation team to increase review and identify and remove inappropriate content, and take action against “those who violate our community guidelines.”
Sharpness aware minimization
Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.