Léon Bottou

Léon Bottou

Léon-Yves Bottou (French pronunciation: [leɔ̃ bɔtu]; born 1965) is a researcher best known for his work in machine learning and data compression. His work presents stochastic gradient descent as a fundamental learning algorithm. He is also one of the main creators of the DjVu image compression technology (together with Yann LeCun and Patrick Haffner), and the maintainer of DjVuLibre, the open source implementation of DjVu. He is the original developer of the Lush programming language. == Life == Léon Bottou was born in France in 1965. He obtained the Diplôme d'Ingénieur from École Polytechnique in 1987, a Magistère de Mathématiques Fondamentales et Appliquées et d’Informatique from École Normale Supérieure in 1988, a Diplôme d'Études Approndies in Computer Science in 1988, in 1988, and a PhD from Université Paris-Sud in 1991. In 1988, in collaboration with Yann LeCun, he published SN, a software package for simulating artificial neural networks. His master's thesis concerned using Time Delay Neural Networks for speech recognition. He then joined the Adaptive Systems Research Department at AT&T Bell Laboratories in Holmdel, New Jersey, where he collaborated with Vladimir Vapnik on local learning algorithms. in 1992, he returned to France and founded Neuristique S.A., a company that produced machine learning tools and one of the first data mining software packages, including Lush, an object-oriented programming language based on C and Lisp designed for training and using large-scale neural networks. In 1995, he returned to Bell Laboratories, where he developed a number of new machine learning methods, such as Graph Transformer Networks (similar to conditional random field), and applied them to handwriting recognition and OCR. The bank check recognition system that he helped develop was widely deployed by NCR and other companies, reading over 10% of all the checks in the US in the late 1990s and early 2000s. In 1996, he joined AT&T Labs and worked primarily on the DjVu image compression technology, that is used by some websites, notably the Internet Archive, to distribute scanned documents. Between 2002 and 2010, he was a research scientist at NEC Laboratories in Princeton, New Jersey, where he focused on the theory and practice of machine learning with large-scale datasets, on-line learning, and stochastic optimization methods. He developed the open source software LaSVM for fast large-scale support vector machine, and stochastic gradient descent software for training linear SVM and Conditional Random Fields. In 2010 he joined the Microsoft adCenter in Redmond, Washington, and in 2012 became a Principal Researcher at Microsoft Research in New York City. In March 2015 he joined Facebook Artificial Intelligence Research, also in New York City, as a research lead. His work in gradient descent argued that both stochastic gradient descent and batch gradient descent reach similar levels of loss with the same number of training samples, but SGD is faster when running on large datasets. He also argued that second-order gradient descent methods, such as quasi-Newton methods, can be beneficial compared to plain SGD. See (Bottou et al 2018) for a review. He was program chair of the 2013 Conference on Neural Information Processing Systems and the 2009 International Conference on Machine Learning. In 2007, he was received one of the first Blavatnik Awards for Young Scientists from the Blavatnik Family Foundation and the New York Academy of Sciences.

FastTrack Automation Studio

FastTrack Automation Studio (formerly known as FastTrack Scripting Host), often referred to as just FastTrack, is a scripting language for Windows IT System Administrators. The product’s goal is to handle any kind of scripting that might be required to automate processes with Microsoft Windows networks. == Manufacturer == FastTrack is produced by FastTrack Software, which is headquartered in Aalborg, Denmark. The product is promoted by the manufacturer as a one-stop shop for Windows script writers and its development paradigm is “one operation = one script line”. Script writers use a purpose-built editor to create scripts, inserting script lines via menus, drag’n drop, or simply typing them in. Scripts may be used out of the box, created from scratch, imported from forums or other users, or customized from product documentation. == Types of scripts == Simple scripts include: Outlook Signatures Login scripts Backup and replication scripts Inventory and asset management Automated Windows OS installation and deployment Automated application software deployment Active Directory scripts More advanced scripts include: SCCM task sequences Citrix ICA and RDP Clients built-in Deploying applications to server farms Deploying GPO MSI files SQL Server scripts == Basic structure == Under the hood, scripts comprise commands, functions, collections, and conditions. When a script is executed these components are converted into many lines of C# code, sometimes hundreds of lines, depending on the particular script operation. Scripts can be compiled into EXE files or MSI packages and treated as standalone Windows applications. == History == FastTrack Scripting Host (FastTrack) was first developed around 2006 to ease the administration burden of IT System Administrators on Windows networks. === Product idea === The idea for the product came from founder and President of FastTrack Software, Lars Pedersen, who has a background in systems administration. Previously with Telenor, Denmark’s major telephone company, Pedersen performed various roles in systems administration, programming and web development. He also worked as a consultant and developer on several major projects at various companies in Europe. Dissatisfied from his own experiences and frustrations administering Windows networks, Pederson looked for a way to make life easier for system administrators. In particular, he wanted something that could minimize the amount of time needed each day to perform routine and mundane tasks, which was a waste of time and expertise that should have been committed to other projects. === Development === Leading a small team of developers, Pedersen developed FastTrack Scripting Host to simplify and automate the routine tasks of system administrators. The resulting product is definitely a scripting language, but it can be used intuitively like a programming language, without requiring users to learn syntax or other concepts typically associated with programming languages. === Marketing === In April 2010, FastTrack Software entered into an agreement with Binary Research International Archived 2008-10-15 at the Wayback Machine, based in the city of Milwaukee, United States to market and sell the product globally. === Awards === FSH received a Windows IT Pro Community Choice award in 2012. == Versions == The first version was produced in June 2006 and contained 51 components, which are the commands, functions, conditions and collections making up FastTrack. The following table summarizes dates and components for major releases. Companies and organizations such as NOAA, Kawasaki, and Goodyear have used and implemented the FastTrack Scripting Host. == Comparison with other scripting software == FastTrack Scripting Host Kixtart PowerShell ScriptLogic VBScript

Perusall

Perusall is a social web annotation tool intended for use by students at schools and universities. It allows users to annotate the margins of a text in a virtual group setting that is similar to social media—with upvoting, emojis, chat functionality, and notification. It also includes automatic AI grading. == History == Perusall began as a research project at Harvard University. It later became an educational product for students and teachers. As of 2024, Perusall states more than 5 million students have used the tool at over 5,000 educational institutions in 112 countries." == Functionality == Perusall integrates with learning management systems such as Moodle, Canvas and Blackboard to aid with collaborative annotation. The tool supports annotation of a range of media including text, images, equations, videos, PDFs and snapshots of webpages.

Time series

In mathematics, a time series is a sequence of data points indexed, listed, or graphed in chronological order. Most commonly, a time series consists of observations recorded at successive equally spaced points in time. Thus, it represents a form of discrete-time data. A time series may describe measurements collected over seconds, days, years, or even centuries. Common examples include heights of ocean tides, counts of sunspots, daily temperature readings, and the closing values of stock market indices such as the Dow Jones Industrial Average. A time series is often visualized using a run chart (a type of temporal line chart), which helps identify patterns such as trends, seasonal effects, and irregular fluctuations. Time series are widely used in statistics, actuarial science, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and many other areas of applied science and engineering that involve temporal measurements. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. Generally, time series data is modeled as a stochastic process. While regression analysis is often employed in such a way as to test relationships between one or more different time series, this type of analysis is not usually called "time series analysis", which refers in particular to relationships between different points in time within a single series. Time series data have a natural temporal ordering. This makes time series analysis distinct from cross-sectional studies, in which there is no natural ordering of the observations (e.g. explaining people's wages by reference to their respective education levels, where the individuals' data could be entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values (see time reversibility). Time series analysis can be applied to real-valued, continuous data, discrete numeric data, or discrete symbolic data (i.e. sequences of characters, such as letters and words in the English language). == Methods for analysis == Methods for time series analysis may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In the time domain, correlation and analysis can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in the frequency domain. Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving-average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure. Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate. == Panel data == A time series is one type of panel data. Panel data is the general class, a multidimensional data set, whereas a time series data set is a one-dimensional panel (as is a cross-sectional dataset). A data set may exhibit characteristics of both panel data and time series data. One way to tell is to ask what makes one data record unique from the other records. If the answer is the time data field, then this is a time series data set candidate. If determining a unique record requires a time data field and an additional identifier which is unrelated to time (e.g. student ID, stock symbol, country code), then it is panel data candidate. If the differentiation lies on the non-time identifier, then the data set is a cross-sectional data set candidate. == Analysis == There are several types of motivation and data analysis available for time series which are appropriate for different purposes. === Motivation === In the context of statistics, econometrics, quantitative finance, seismology, meteorology, and geophysics the primary goal of time series analysis is forecasting. In the context of signal processing, control engineering and communication engineering it is used for signal detection. Other applications are in data mining, pattern recognition and machine learning, where time series analysis can be used for clustering, classification, query by content, anomaly detection as well as forecasting. === Exploratory analysis === A simple way to examine a regular time series is manually with a line chart. The datagraphic shows tuberculosis deaths in the United States, along with the yearly change and the percentage change from year to year. The total number of deaths declined in every year until the mid-1980s, after which there were occasional increases, often proportionately - but not absolutely - quite large. A study of corporate data analysts found two challenges to exploratory time series analysis: discovering the shape of interesting patterns, and finding an explanation for these patterns. Visual tools that represent time series data as heat map matrices can help overcome these challenges. === Estimation, filtering, and smoothing === This approach may be based on harmonic analysis and filtering of signals in the frequency domain using the Fourier transform, and spectral density estimation. Its development was significantly accelerated during World War II by mathematician Norbert Wiener, electrical engineers Rudolf E. Kálmán, Dennis Gabor and others for filtering signals from noise and predicting signal values at a certain point in time. An equivalent effect may be achieved in the time domain, as in a Kalman filter; see filtering and smoothing for more techniques. Other related techniques include: Autocorrelation analysis to examine serial dependence Spectral analysis to examine cyclic behavior which need not be related to seasonality. For example, sunspot activity varies over 11 year cycles. Other common examples include celestial phenomena, weather patterns, neural activity, commodity prices, and economic activity. Separation into components representing trend, seasonality, slow and fast variation, and cyclical irregularity: see trend estimation and decomposition of time series === Curve fitting === Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data. For processes that are expected to generally grow in magnitude one of the curves in the graphic (and many others) can be fitted by estimating their parameters. The construction of economic time series involves the estimation of some components for some dates by interpolation between values ("benchmarks") for earlier and later dates. Interpolation is estimation of an unknown quantity between two known quantities (historical data), or drawing conclusions about missing information from the available information ("reading between the lines"). Interpolation is useful where the data surrounding the missing data is available and its trend, seasonality, and longer-term cycles are known. This is often done by using a relat

Energy-based model

An energy-based model (EBM), also called Canonical Ensemble Learning (CEL) or Learning via Canonical Ensemble (LCE), is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence. EBMs provide a unified framework for many probabilistic and non-probabilistic approaches to such learning, particularly for training graphical and other structured models. An EBM learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models, the energy functions of which are parameterized by modern deep neural networks. Boltzmann machines are a special form of energy-based models with a specific parametrization of the energy. == Description == For a given input x {\displaystyle x} , the model describes an energy E θ ( x ) {\displaystyle E_{\theta }(x)} such that the Boltzmann distribution P θ ( x ) = e − β E θ ( x ) Z ( θ ) {\displaystyle P_{\theta }(x)={e^{-\beta E_{\theta }(x)} \over Z(\theta )}} is a probability (density), and typically β = 1 {\displaystyle \beta =1} . Since the normalization constant: Z ( θ ) := ∫ x ∈ X e − β E θ ( x ) d x {\displaystyle Z(\theta ):=\int _{x\in X}e^{-\beta E_{\theta }(x)}dx} (also known as the partition function) depends on all the Boltzmann factors of all possible inputs x {\displaystyle x} , it cannot be easily computed or reliably estimated during training simply using standard maximum likelihood estimation. However, for maximizing the likelihood during training, the gradient of the log-likelihood of a single training example x {\displaystyle x} is given by using the chain rule: ∂ θ log ⁡ ( P θ ( x ) ) = E x ′ ∼ P θ [ ∂ θ E θ ( x ′ ) ] − ∂ θ E θ ( x ) ( ∗ ) {\displaystyle \partial _{\theta }\log \left(P_{\theta }(x)\right)=\mathbb {E} _{x'\sim P_{\theta }}[\partial _{\theta }E_{\theta }(x')]-\partial _{\theta }E_{\theta }(x)\,()} The expectation in the above formula for the gradient can be approximately estimated by drawing samples x ′ {\displaystyle x'} from the distribution P θ {\displaystyle P_{\theta }} using Markov chain Monte Carlo (MCMC). Early energy-based models, such as the 2003 Boltzmann machine by Hinton, estimated this expectation via blocked Gibbs sampling. Newer approaches make use of more efficient Stochastic Gradient Langevin Dynamics (LD), drawing samples using: x 0 ′ ∼ P 0 , x i + 1 ′ = x i ′ − α 2 ∂ E θ ( x i ′ ) ∂ x i ′ + ϵ {\displaystyle x_{0}'\sim P_{0},x_{i+1}'=x_{i}'-{\frac {\alpha }{2}}{\frac {\partial E_{\theta }(x_{i}')}{\partial x_{i}'}}+\epsilon } , where ϵ ∼ N ( 0 , α ) {\displaystyle \epsilon \sim {\mathcal {N}}(0,\alpha )} . A replay buffer of past values x i ′ {\displaystyle x_{i}'} is used with LD to initialize the optimization module. The parameters θ {\displaystyle \theta } of the neural network are therefore trained in a generative manner via MCMC-based maximum likelihood estimation: the learning process follows an "analysis by synthesis" scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method (e.g., Langevin dynamics or Hybrid Monte Carlo), and then updates the parameters θ {\displaystyle \theta } based on the difference between the training examples and the synthesized ones – see equation ( ∗ ) {\displaystyle ()} . This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. Essentially, the model learns a function E θ {\displaystyle E_{\theta }} that associates low energies to correct values, and higher energies to incorrect values. After training, given a converged energy model E θ {\displaystyle E_{\theta }} , the Metropolis–Hastings algorithm can be used to draw new samples. The acceptance probability is given by: P a c c ( x i → x ∗ ) = min ( 1 , P θ ( x ∗ ) P θ ( x i ) ) . {\displaystyle P_{acc}(x_{i}\to x^{})=\min \left(1,{\frac {P_{\theta }(x^{})}{P_{\theta }(x_{i})}}\right).} == History == The term "energy-based models" was first coined in a 2003 JMLR paper where the authors defined a generalisation of independent components analysis to the overcomplete setting using EBMs. Other early work on EBMs proposed models that represented energy as a composition of latent and observable variables. == Characteristics == EBMs demonstrate useful properties: Simplicity and stability. The EBM is the only object that needs to be designed and trained. Separate networks need not be trained to ensure balance. Adaptive computation time. An EBM can generate sharp, diverse samples or (more quickly) coarse, less diverse samples. Given infinite time, this procedure produces true samples. Flexibility. In Variational Autoencoders (VAE) and flow-based models, the generator learns a map from a continuous space to a (possibly) discontinuous space containing different data modes. EBMs can learn to assign low energies to disjoint regions (multiple modes). Adaptive generation. EBM generators are implicitly defined by the probability distribution, and automatically adapt as the distribution changes (without training), allowing EBMs to address domains where generator training is impractical, as well as minimizing mode collapse and avoiding spurious modes from out-of-distribution samples. Compositionality. Individual models are unnormalized probability distributions, allowing models to be combined through product of experts or other hierarchical techniques. == Experimental results == On image datasets such as CIFAR-10 and ImageNet 32x32, an EBM model generated high-quality images relatively quickly. It supported combining features learned from one type of image for generating other types of images. It was able to generalize using out-of-distribution datasets, outperforming flow-based and autoregressive models. EBM was relatively resistant to adversarial perturbations, behaving better than models explicitly trained against them with training for classification. == Applications == Target applications include natural language processing, robotics and computer vision. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation (e.g., image synthesis, video synthesis, 3D shape synthesis, etc.), data recovery (e.g., recovering videos with missing pixels or image frames, 3D super-resolution, etc), data reconstruction (e.g., image reconstruction and linear interpolation ). == Alternatives == EBMs compete with techniques such as variational autoencoders (VAEs), generative adversarial networks (GANs) or normalizing flows. == Extensions == === Joint energy-based models === Joint energy-based models (JEM), proposed in 2020 by Grathwohl et al., allow any classifier with softmax output to be interpreted as energy-based model. The key observation is that such a classifier is trained to predict the conditional probability p θ ( y | x ) = e f → θ ( x ) [ y ] ∑ j = 1 K e f → θ ( x ) [ j ] for y = 1 , … , K and f → θ = ( f 1 , … , f K ) ∈ R K , {\displaystyle p_{\theta }(y|x)={\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{\sum _{j=1}^{K}e^{{\vec {f}}_{\theta }(x)[j]}}}\ \ {\text{ for }}y=1,\dotsc ,K{\text{ and }}{\vec {f}}_{\theta }=(f_{1},\dotsc ,f_{K})\in \mathbb {R} ^{K},} where f → θ ( x ) [ y ] {\displaystyle {\vec {f}}_{\theta }(x)[y]} is the y-th index of the logits f → {\displaystyle {\vec {f}}} corresponding to class y. Without any change to the logits it was proposed to reinterpret the logits to describe a joint probability density: p θ ( y , x ) = e f → θ ( x ) [ y ] Z ( θ ) , {\displaystyle p_{\theta }(y,x)={\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}},} with unknown partition function Z ( θ ) {\displaystyle Z(\theta )} and energy E θ ( x , y ) = − f θ ( x ) [ y ] {\displaystyle E_{\theta }(x,y)=-f_{\theta }(x)[y]} . By marginalization, we obtain the unnormalized density p θ ( x ) = ∑ y p θ ( y , x ) = ∑ y e f → θ ( x ) [ y ] Z ( θ ) =: e − E θ ( x ) , {\displaystyle p_{\theta }(x)=\sum _{y}p_{\theta }(y,x)=\sum _{y}{\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}}=:e^{-E_{\theta }(x)},} therefore, E θ ( x ) = − log ⁡ ( ∑ y e f → θ ( x ) [ y ] Z ( θ ) ) , {\displaystyle E_{\theta }(x)=-\log \left(\sum _{y}{\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}}\right),} so that any classifier can be used to define an energy function E θ ( x ) {\displaystyle E_{\theta }(x)} .

Instance-based learning

In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have been stored in memory. Because computation is postponed until a new instance is observed, these algorithms are sometimes referred to as "lazy." It is called instance-based because it constructs hypotheses directly from the training instances themselves. This means that the hypothesis complexity can grow with the data: in the worst case, a hypothesis is a list of n training items and the computational complexity of classifying a single new instance is O(n). One advantage that instance-based learning has over other methods of machine learning is its ability to adapt its model to previously unseen data. Instance-based learners may simply store a new instance or throw an old instance away. Examples of instance-based learning algorithms are the k-nearest neighbors algorithm, kernel machines and RBF networks. These store (a subset of) their training set; when predicting a value/class for a new instance, they compute distances or similarities between this instance and the training instances to make a decision. To battle the memory complexity of storing all training instances, as well as the risk of overfitting to noise in the training set, instance reduction algorithms have been proposed.

Grokking (machine learning)

In machine learning, grokking, or delayed generalization, is a phenomenon observed in some settings where a model abruptly transitions from overfitting (performing well only on training data) to generalizing (performing well on both training and test data), after many training iterations with little or no improvement on the held-out data. This contrasts with what is typically observed in machine learning, where generalization occurs gradually alongside improved performance on training data. == Origin == Grokking was introduced by OpenAI researcher Alethea Power and colleagues in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". It is derived from the word grok coined by Robert Heinlein in his novel Stranger in a Strange Land. In ML research, "grokking" is not used as a synonym for "generalization"; rather, it names a sometimes-observed delayed‑generalization training phenomenon in which training and held‑out performance do not improve in tandem, and in which held‑out performance rises abruptly later. Authors also analyze the "grokking time", the epoch or step at which this transition occurs in those scenarios. == Interpretations == Grokking can be understood as a phase transition during the training process. In particular, recent work has shown that grokking may be due to a complexity phase transition in the model during training. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research. One potential explanation is that the weight decay (a component of the loss function that penalizes higher values of the neural network parameters, also called regularization) slightly favors the general solution that involves lower weight values, but that is also harder to find. According to Neel Nanda, the process of learning the general solution may be gradual, even though the transition to the general solution occurs more suddenly later. Recent theories have hypothesized that grokking occurs when neural networks transition from a "lazy training" regime where the weights do not deviate far from initialization, to a "rich" regime where weights abruptly begin to move in task-relevant directions. Follow-up empirical and theoretical work has accumulated evidence in support of this perspective, and it offers a unifying view of earlier work as the transition from lazy to rich training dynamics is known to arise from properties of adaptive optimizers, weight decay, initial parameter weight norm, and more. This perspective is complementary to a unifying "pattern learning speeds" framework that links grokking and double descent; within this view, delayed generalization can arise across training time ("epoch‑wise") or across model size ("model‑wise"), and the authors report "model‑wise grokking".