Self-supervised learning

Self-supervised learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects. During SSL, the model learns in two steps. First, the task is solved based on an auxiliary or pretext classification task using pseudo-labels, which help to initialize the model parameters. Next, the actual task is performed with supervised or unsupervised learning. Self-supervised learning has produced promising results in recent years, and has found practical application in fields such as audio processing, and is being used by Facebook and others for speech recognition. == Pseudo-labels == Pseudo-labels are automatically generated labels that a model assigns to unlabeled data based on its own predictions. They are widely used in self-supervised and semi-supervised learning, where ground-truth annotations are limited or unavailable. By treating predicted labels as surrogate ground truth, learning algorithms can make use of large quantities of unlabeled data in the training process. Pseudo-labeling also plays an important role in systems that must adapt to concept drift, where the statistical properties of the data change over time. In these scenarios, the model may detect that an incoming instance deviates from previously learned behavior. The system then generates a classification result for that instance, and this predicted class is used as a pseudo-label for updating or retraining model components that are becoming outdated. This approach enables continuous adaptation in dynamic environments without requiring manual annotation. In many adaptive learning pipelines, pseudo-labels are chosen when the classifier produces sufficiently confident predictions, reducing the risk of propagating errors. These pseudo-labeled instances are then incorporated into training to refresh or evolve the model's understanding of emerging data patterns, particularly when existing components show signs of “aging” due to drift or distributional shifts. This strategy reduces reliance on manual labeling while helping maintain long-term model performance. == Types == === Autoassociative self-supervised learning === Autoassociative self-supervised learning is a specific category of self-supervised learning where a neural network is trained to reproduce or reconstruct its own input data. In other words, the model is tasked with learning a representation of the data that captures its essential features or structure, allowing it to regenerate the original input. The term "autoassociative" comes from the fact that the model is essentially associating the input data with itself. This is often achieved using autoencoders, which are a type of neural network architecture used for representation learning. Autoencoders consist of an encoder network that maps the input data to a lower-dimensional representation (latent space), and a decoder network that reconstructs the input from this representation. The training process involves presenting the model with input data and requiring it to reconstruct the same data as closely as possible. The loss function used during training typically penalizes the difference between the original input and the reconstructed output (e.g. mean squared error). By minimizing this reconstruction error, the autoencoder learns a meaningful representation of the data in its latent space. === Contrastive self-supervised learning === For a binary classification task, training data can be divided into positive examples and negative examples. Positive examples are those that match the target. For example, if training a classifier to identify birds, the positive training data would include images that contain birds. Negative examples would be images that do not. Contrastive self-supervised learning uses both positive and negative examples. The loss function in contrastive learning is used to minimize the distance between positive sample pairs, while maximizing the distance between negative sample pairs. An early example uses a pair of 1-dimensional convolutional neural networks to process a pair of images and maximize their agreement. Contrastive Language-Image Pre-training (CLIP) allows joint pretraining of a text encoder and an image encoder, such that a matching image-text pair have image encoding vector and text encoding vector that span a small angle (having a large cosine similarity). InfoNCE (Noise-Contrastive Estimation) is a method to optimize two models jointly, based on Noise Contrastive Estimation (NCE). Given a set X = { x 1 , … x N } {\displaystyle X=\left\{x_{1},\ldots x_{N}\right\}} of N {\displaystyle N} random samples containing one positive sample from p ( x t + k ∣ c t ) {\displaystyle p\left(x_{t+k}\mid c_{t}\right)} and N − 1 {\displaystyle N-1} negative samples from the 'proposal' distribution p ( x t + k ) {\displaystyle p\left(x_{t+k}\right)} , it minimizes the following loss function: L N = − E X [ log ⁡ f k ( x t + k , c t ) ∑ x j ∈ X f k ( x j , c t ) ] {\displaystyle {\mathcal {L}}_{\mathrm {N} }=-\mathbb {E} _{X}\left[\log {\frac {f_{k}\left(x_{t+k},c_{t}\right)}{\sum _{x_{j}\in X}f_{k}\left(x_{j},c_{t}\right)}}\right]} === Non-contrastive self-supervised learning === Non-contrastive self-supervised learning (NCSSL) uses only positive examples. Counterintuitively, NCSSL converges on a useful local minimum rather than reaching a trivial solution, with zero loss. For the example of binary classification, it would trivially learn to classify each example as positive. Effective NCSSL requires an extra predictor on the online side that does not back-propagate on the target side. === Joint-Embedding and Predictive Architectures === A major class of self-supervised learning moves beyond contrastive pairs, instead maximizing the agreement between views while preventing collapse through statistical constraints. Rooted in Deep Canonical Correlation Analysis (Deep CCA), this approach includes Joint-Embedding Architectures (JEA) like Barlow Twins and VICReg, which enforce covariance constraints to learn invariant representations without negative sampling. Deep Latent Variable Path Modelling (DLVPM) generalizes this to multimodal systems, using path models to enforce correlation and orthogonality across diverse data types. In 2022 Yann LeCun introduced Joint-Embedding Predictive Architectures (JEPA) as a step towards decision making, reasoning, and autonomous human intelligence in machines, including self-improvement through autonomous learning. Founded in representation learning, LeCun included the concept of a “world model” in JEPA which aims to enable machines to replicate human intellect by providing machines with a concept for the world in which they exist. Unlike autoencoders, JEPAs operate entirely in latent space, avoiding pixel-level noise to focus on semantic structure. Rather than just learning invariance, JEPAs learn by predicting masked latent representations from visible context. JEPA has been applied to domains such as image analysis, audio processing, and motion in images and video. == Comparison with other forms of machine learning == SSL belongs to supervised learning methods insofar as the goal is to generate a classified output from the input. At the same time, however, it does not require the explicit use of labeled input-output pairs. Instead, correlations, metadata embedded in the data, or domain knowledge present in the input are implicitly and autonomously extracted from the data. These supervisory signals, extracted from the data, can then be used for training. SSL is similar to unsupervised learning in that it does not require labels in the sample data. Unlike unsupervised learning, however, learning is not done using inherent data structures. Semi-supervised learning combines supervised and unsupervised learning, requiring only a small portion of the learning data be labeled. In transfer learning, a model designed for one task is reused on a different task. Training an autoencoder intrinsically constitutes a self-supervised process, because the output pattern needs to become an optimal reconstruction of the input pattern itself. However, in current jargon, the term 'self-supervised' often refers to tasks based on a pretext-task training setup

Learning automaton

A learning automaton is one type of machine learning algorithm studied since 1970s. Learning automata select their current action based on past experiences from the environment. It will fall into the range of reinforcement learning if the environment is stochastic and a Markov decision process (MDP) is used. == History == Research in learning automata can be traced back to the work of Michael Lvovitch Tsetlin in the early 1960s in the Soviet Union. Together with some colleagues, he published a collection of papers on how to use matrices to describe automata functions. Additionally, Tsetlin worked on reasonable and collective automata behaviour, and on automata games. Learning automata were also investigated by researches in the United States in the 1960s. However, the term learning automaton was not used until Narendra and Thathachar introduced it in a survey paper in 1974. == Definition == A learning automaton is an adaptive decision-making unit situated in a random environment that learns the optimal action through repeated interactions with its environment. The actions are chosen according to a specific probability distribution which is updated based on the environment response the automaton obtains by performing a particular action. With respect to the field of reinforcement learning, learning automata are characterized as policy iterators. In contrast to other reinforcement learners, policy iterators directly manipulate the policy π. Another example for policy iterators are evolutionary algorithms. Formally, Narendra and Thathachar define a stochastic automaton to consist of: a set X of possible inputs, a set Φ = { Φ1, ..., Φs } of possible internal states, a set α = { α1, ..., αr } of possible outputs, or actions, with r ≤ s, an initial state probability vector p(0) = ≪ p1(0), ..., ps(0) ≫, a computable function A which after each time step t generates p(t+1) from p(t), the current input, and the current state, and a function G: Φ → α which generates the output at each time step. In their paper, they investigate only stochastic automata with r = s and G being bijective, allowing them to confuse actions and states. The states of such an automaton correspond to the states of a "discrete-state discrete-parameter Markov process". At each time step t=0,1,2,3,..., the automaton reads an input from its environment, updates p(t) to p(t+1) by A, randomly chooses a successor state according to the probabilities p(t+1) and outputs the corresponding action. The automaton's environment, in turn, reads the action and sends the next input to the automaton. Frequently, the input set X = { 0,1 } is used, with 0 and 1 corresponding to a nonpenalty and a penalty response of the environment, respectively; in this case, the automaton should learn to minimize the number of penalty responses, and the feedback loop of automaton and environment is called a "P-model". More generally, a "Q-model" allows an arbitrary finite input set X, and an "S-model" uses the interval [0,1] of real numbers as X. A visualised demo/ Art Work of a single Learning Automaton had been developed by μSystems (microSystems) Research Group at Newcastle University. == Finite action-set learning automata == Finite action-set learning automata (FALA) are a class of learning automata for which the number of possible actions is finite or, in more mathematical terms, for which the size of the action-set is finite.

Objective vision

Objective Vision (Object Oriented Visionary) is a project mainly aimed at real-time computer vision and simulation vision of living creatures. it has three sections containing an open-source library of programming functions for using inside the projects, Virtual laboratory for scholars to check the application of functions directly and by command-line code for external and instant access, and the research section consists of paperwork and libraries to expand the scientific prove of works. == Background == The process has been used in the OVC libraries is as same as what's happening when living see a picture, and it's designed to give the researchers to experience the brain's visual cortex most close simulation for picture perception. The OVC was designed to work as a simulated visual cortex that has a critical job in processing and classify the objects to make it easier to work with pictures and graphical perception and processing. The human brain is much more aware of how it solves complex problems such as playing chess or solving algebra equations, which is why computer programmers have had so much success building machines that emulate this type of activity. but when the whole process is still a riddle that how the entities visionary system works. The project was simulated the visionary system by how it starts to convert the signals to image(actually the edges and colors) and then recognizing the shapes to find a relation between brain's information and image. The Objective Visionary system actually is concentrating on the separable sections, this separation gives the application visionary system the excellence processing result, because with this method the system do not waste much time on processing non significant sections and signals. this operation in the Objective Vision project called objective processing and because the O.V. mission is focused on human visionary simulation, so the developer refers with Objective Vision. == History == Objective-Vision is a Human (Natural) Visionary simulation Project developed by Michael Bidollahkhany. Following an explosion of interest during the 21st century were characterized by the maturing of the field and the significant growth of active applications; simulation of visionary systems, visionary based autonomous vehicle guidance, medical imaging (2D and 3D) and automatic surveillance are the most rapidly developing areas. This progress can be seen in an increasing number of software and hardware products on the market, as well as in a number of digital image processing software and APIs and also machine vision courses offered at universities worldwide. Therefore, the OVC project has been released as a research software project in 2016. One of important parts of this project was O.V.C. (Objective Vision Class library), that was designed to able companies and scientists to use the brain's most likely functionalities as visionary libraries to simplify and accelerate the image processing algorithms developments. The project started under MIT copyright license, but since 2018 the project continued as classified based on sponsors opinion. == The Algorithm == As developers claimed the algorithm used in the class library and developer's kit of project has been developed based on natural visionary system, and the functionalities containing image processing, optimization and labeling etc. are mostly upgraded and near techniques. Suppose that we've a picture of a jungle, or somewhere else, with this library developer will be able to manipulate not only the pixel of images for data extraction, but automatically based on which algorithm is used and image quality, he can manipulate directly a list of objects, same pixels and every data project needs to have, said the developer in his lecture answering how the algorithm works. === Viewpoint === For long times digital image processing and storing, was actually by processing just pixels; this Project tries to present a new kind of image processing and even storing, "objective vision" or "object-oriented visionary" is called. This project officially launched in May 2016, with the aim of making more adaptation between Computer Vision (Include Visionary, Digital image processing, discernment and even Perception) and Human Visual System; about development of the project: "...so we decided to research on Human Vision System, besides we worked on Artificial Retinal image processing and new visionary optimization unit(Presented at Istanbul Technical University Conference(Turkey 2015-2016)) and grew our research to Visionary CORTEX of Brain", Michael Bidollahkhany said. == Applications == The OVC application areas include: 2D and 3D feature toolkits Egomotion estimation Human–computer interaction (HCI) Mobile robotics Motion understanding Object identification Segmentation and recognition Stereopsis stereo vision: depth perception from two cameras Structure from motion (SFM) Motion tracking == Programming language == In first initial release of Objective Visionary Project the algorithm has been written in C++ and C#, and the virtual laboratory has been developed in C# and Delphi. Based on developers last lecture since the second release the complete algorithm has been re-written in C# based on .Net Core 1.0 to make it easier to work on different operating systems.

Irwin Sobel

Irwin Sobel (born September 12, 1940) is a scientist and researcher in digital image processing. == Biography == Irwin Sobel was born in New York City. He graduated from MIT in 1961 and completed his Ph.D. research at the Stanford Artificial Intelligence Project (SAIL) with thesis Camera Models and Machine Perception. His Ph.D. advisor was Jerome A. Feldman. Starting in 1973, he spent nine years doing postdoctoral research at Columbia University. After 1982, he worked as a Senior Researcher at HP Labs. == Sobel operator == In 1968, Sobel gave a talk entitled "An Isotropic 3x3 Image Gradient Operator" at SAIL; this method became known as the Sobel operator. It was developed jointly with a colleague, Gary Feldman, also at SAIL.

Evolutionary robotics

Evolutionary robotics is an embodied approach to Artificial Intelligence (AI) in which robots are automatically designed using Darwinian principles of natural selection. The design of a robot, or a subsystem of a robot such as a neural controller, is optimized against a behavioral goal (e.g. run as fast as possible). Usually, designs are evaluated in simulations as fabricating thousands or millions of designs and testing them in the real world is prohibitively expensive in terms of time, money, and safety. An evolutionary robotics experiment starts with a population of randomly generated robot designs. The worst performing designs are discarded and replaced with mutations and/or combinations of the better designs. This evolutionary algorithm continues until a prespecified amount of time elapses or some target performance metric is surpassed. Evolutionary robotics methods are particularly useful for engineering machines that must operate in environments in which humans have limited intuition (nanoscale, space, etc.). Evolved simulated robots can also be used as scientific tools to generate new hypotheses in biology and cognitive science, and to test old hypothesis that require experiments that have proven difficult or impossible to carry out in reality. == History == In the early 1990s, two separate European groups demonstrated different approaches to the evolution of robot control systems. Dario Floreano and Francesco Mondada at EPFL evolved controllers for the Khepera robot. Adrian Thompson, Nick Jakobi, Dave Cliff, Inman Harvey, and Phil Husbands evolved controllers for a Gantry robot at the University of Sussex. However the body of these robots was presupposed before evolution. The first simulations of evolved robots were reported by Karl Sims and Jeffrey Ventrella of the MIT Media Lab, also in the early 1990s. However these so-called virtual creatures never left their simulated worlds. The first evolved robots to be built in reality were 3D-printed by Hod Lipson and Jordan Pollack at Brandeis University at the turn of the 21st century.

Human–robot interaction

Human–robot interaction (HRI) is the study of interactions between humans and robots. Human–robot interaction is a multidisciplinary field with contributions from human–computer interaction, artificial intelligence, robotics, natural language processing, design, psychology and philosophy. A subfield known as physical human–robot interaction (pHRI) has tended to focus on device design to enable people to safely interact with robotic systems. == Origins == Human–robot interaction has been a topic of both science fiction and academic speculation even before any robots existed. Because much of active HRI development depends on natural language processing, many aspects of HRI are continuations of human communications, a field of research which is much older than robotics. The origin of HRI as a discrete problem was stated by 20th-century author Isaac Asimov in 1941, in his novel I, Robot. Asimov coined Three Laws of Robotics, namely: A robot may not injure a human being or, through inaction, allow a human being to come to harm. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws. These three laws provide an overview of the goals engineers and researchers hold for safety in the HRI field, although the fields of robot ethics and machine ethics are more complex than these three principles. However, generally human–robot interaction prioritizes the safety of humans that interact with potentially dangerous robotics equipment. Solutions to this problem range from the philosophical approach of treating robots as ethical agents (individuals with moral agency), to the practical approach of creating safety zones. These safety zones use technologies such as lidar to detect human presence or physical barriers to protect humans by preventing any contact between machine and operator. Although initially robots in the human–robot interaction field required some human intervention to function, research has expanded this to the extent that fully autonomous systems are now far more common than in the early 2000s. Autonomous systems include from simultaneous localization and mapping systems which provide intelligent robot movement to natural-language processing and natural-language generation systems which allow for natural, human-esque interaction which meet well-defined psychological benchmarks. Anthropomorphic robots (machines which imitate human body structure) are better described by the biomimetics field, but overlap with HRI in many research applications. Examples of robots which demonstrate this trend include Willow Garage's PR2 robot, the NASA Robonaut, and Honda ASIMO. However, robots in the human–robot interaction field are not limited to human-like robots: Paro and Kismet are both robots designed to elicit emotional response from humans, and so fall into the category of human–robot interaction. Goals in HRI range from industrial manufacturing through Cobots, medical technology through rehabilitation, autism intervention, and elder care devices, entertainment, human augmentation, and human convenience. Future research therefore covers a wide range of fields, much of which focuses on assistive robotics, robot-assisted search-and-rescue, and space exploration. == The goal of friendly human–robot interactions == Robots are artificial agents with capacities of perception and action in the physical world often referred by researchers as workspace. Their use has been generalized in factories but nowadays they tend to be found in the most technologically advanced societies in such critical domains as search and rescue, military battle, mine and bomb detection, scientific exploration, law enforcement, entertainment and hospital care. These new domains of applications imply a closer interaction with the user, sharing the workspace but also goals in terms of task achievement. The subfield of physical human–robot interaction (pHRI) has largely focused on device design to enable people to safely interact with robotic systems but is increasingly developing algorithmic approaches in an attempt to support fluent and expressive interactions between humans and robotic systems. With the advance in AI, the research is focusing on one part towards the safest physical interaction but also on a socially correct interaction, dependent on cultural criteria. The goal is to build an intuitive, and easy communication with the robot through speech, gestures, and facial expressions. Kerstin Dautenhahn refers to friendly Human–robot interaction as "Robotiquette" defining it as the "social rules for robot behaviour (a 'robotiquette') that is comfortable and acceptable to humans" The robot has to adapt itself to our way of expressing desires and orders and not the contrary. But every day environments such as homes have much more complex social rules than those implied by factories or even military environments. Thus, the robot needs perceiving and understanding capacities to build dynamic models of its surroundings. It needs to categorize objects, recognize and locate humans and further recognize their emotions. The need for dynamic capacities pushes forward every sub-field of robotics. Furthermore, by understanding and perceiving social cues, robots can enable collaborative scenarios with humans. For example, with the rapid rise of personal fabrication machines such as desktop 3D printers, laser cutters, etc., entering our homes, scenarios may arise where robots can collaboratively share control, co-ordinate and achieve tasks together. Industrial robots have already been integrated into industrial assembly lines and are collaboratively working with humans. The social impact of such robots have been studied and has indicated that workers still treat robots and social entities, rely on social cues to understand and work together. On the other end of HRI research the cognitive modelling of the "relationship" between human and the robots benefits the psychologists and robotic researchers the user study are often of interests on both sides. This research endeavours part of human society. For effective human – humanoid robot interaction numerous communication skills and related features should be implemented in the design of such artificial agents/systems. == General HRI research == HRI research spans a wide range of fields, some general to the nature of HRI. === Methods for perceiving humans === Methods for perceiving humans in the environment are based on sensor information. Research on sensing components and software led by Microsoft provide useful results for extracting the human kinematics (see Kinect). An example of older technique is to use colour information for example the fact that for light skinned people the hands are lighter than the clothes worn. In any case a human modelled a priori can then be fitted to the sensor data. The robot builds or has (depending on the level of autonomy the robot has) a 3D mapping of its surroundings to which is assigned the humans locations. Most methods intend to build a 3D model through vision of the environment. The proprioception sensors permit the robot to have information over its own state. This information is relative to a reference. Theories of proxemics may be used to perceive and plan around a person's personal space. A speech recognition system is used to interpret human desires or commands. By combining the information inferred by proprioception, sensor and speech the human position and state (standing, seated). In this matter, natural-language processing is concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural-language data. For instance, neural-network architectures and learning algorithms that can be applied to various natural-language processing tasks including part-of-speech tagging, chunking, named-entity recognition, and semantic role labeling. === Methods for motion planning === Motion planning in dynamic environments is a challenge that can at the moment only be achieved for robots with 3 to 10 degrees of freedom. Humanoid robots or even 2 armed robots, which can have up to 40 degrees of freedom, are unsuited for dynamic environments with today's technology. However lower-dimensional robots can use the potential field method to compute trajectories which avoid collisions with humans. === Cognitive models and theory of mind === Humans exhibit negative social and emotional responses as well as decreased trust toward some robots that closely, but imperfectly, resemble humans; this phenomenon has been termed the "Uncanny Valley". However recent research in telepresence robots has established that mimicking human body postures and expressive gestures has made the robots likeable and engaging in a remote setting. Further, the presence o

The Future of Work and Death

The Future of Work and Death is a 2016 documentary by Sean Blacknell and Wayne Walsh about the exponential growth of technology. The film showed at several film festivals including Raindance Film Festival, International Film Festival Rotterdam, Academia Film Olomouc and CPH:DOX. In May 2017 it received an official screening at the European Commission. It was distributed by First Run Features and Journeyman Pictures and was released on iTunes, Amazon Prime and On-demand on 9 May 2017. The film was made available on Sundance Now on 27 November 2017. A companion piece to the film, The Cost of Living, a documentary concerning universal basic income in Britain, was released on Amazon Prime on 8 October 2020. == Synopsis == World experts in the fields of futurology, anthropology, neuroscience, and philosophy consider the impact of technological advances on the two 'certainties' of human life; work and death. Charting human developments from Homo habilis, past the Industrial Revolution, to the digital age and beyond, the film looks at the shocking exponential rate at which mankind has managed to create technologies to ease the process of living. As we embark on the next phase of our adaptation, with automation and artificial intelligence signifying the complete move from man to machine, the film asks what the implications are for human fulfilment in an approaching era of job obsolescence and extreme longevity. == Cast == Dudley Sutton – Narrator Aubrey de Grey – Biomedical gerontologist and CSO of the SENS Research Foundation Will Self – Writer, journalist, political commentator and Professor of Contemporary Thought at Brunel University Rudolph E. Tanzi – Professor of Neurology at Harvard University and Director of the Genetics and Aging Research Unit at Massachusetts General Hospital (MGH) Martin Ford – Futurist and author Steve Fuller – Auguste Comte Chair in Social Epistemology at the Department of sociology at University of Warwick Murray Shanahan – Professor of Cognitive Robotics at Imperial College London Gray Scott – Futurist, executive producer of this production Vivek Wadhwa – Entrepreneur, academic and Director of Research at the Center for Entrepreneurship and Research Commercialization at the Pratt School of Engineering, Duke University Zoltan Istvan – Transhumanist and journalist Joanna Cook – Anthropologist, University College London Nicholas Kamara – Physician, Kable Hospital David Pearce – Transhumanist philosopher and co-founder of Humanity+ Peter Cochrane – Futurist and entrepreneur John Harris – Bioethicist, philosopher and Director of the Institute for Science, Ethics and Innovation at the University of Manchester Riva Melissa-Tez – Entrepreneur and transhumanist Ian Pearson – Futurologist Stuart Armstrong – Artificial intelligence researcher at Future of Humanity Institute