Wolfgang Ketter (born Traben-Trarbach, Germany, 1972) is Chaired Professor of Information Systems for a Sustainable Society at the University of Cologne. and a prominent scientist in the application of artificial intelligence, machine learning and intelligent agents in the design of smart markets, including demand response mechanisms and in particular automated auctions. He is a co-founder of the open energy system platform Power TAC, an automated retail electricity trading platform that simulates the performance of retail markets in an increasingly prosumer- and renewable-energy-influenced electricity landscape. == Career == === Advisory roles === Ketter is an advisor on the energy transition to the German government, in particular, the energy-intensive German state of North Rhine-Westphalia. He is also a fellow of the World Economic Forum and member of the WEF Global Council on Future Mobility and the Global New Mobility Coalition, contributing on the use of AI and machine learning to address issues arising from growth in electrification of energy such as the use of batteries as virtual power plants, the management of electric vehicle charging to prevent grid congestion, or the potential for peer-to-peer electricity trading. Ketter has also been an advisor for over a decade to the Port of Rotterdam on the design of energy cooperatives and energy trading platforms as well as one of the largest auction companies in the world, Royal FloraHolland, where his initial research led to a redesign of auction mechanisms and decision support systems. The cumulative research project team received the Association for Information Systems Impact Award in 2020 === Research === Ketter’s research is multidisciplinary, addressing the overlap of AI and ML in the economics of retail energy and mobility markets. The industry and policy applications of his research interconnect in large-scale projects such as the EU Smart city development project Ruggedised, for which the Erasmus University-based team's publication on the optimization of the City of Rotterdam's electric transit bus network was recognized with the Institute for Operations Research and the Management Sciences Daniel H. Wagner runner-up award. His research focuses on the use of competitive benchmarking and intelligent agents in virtual world simulations of retail energy markets as part of a smart grid. A small-scale version of the Power TAC project led to a publication on demand side management, 'A simulation of household behavior under variable prices' that has several hundred citations in publications representing a variety of scientific disciplines. Two of his publications in the Management Information Systems Quarterly journal and one in Energy Economics form the foundation for the current Power TAC platform. In 2016 and 2019 he was Chair of the Workshop on Information Technologies and Systems. Ketter is Coordinator of the Key Research Initiative Sustainable Smart Energy & Mobility at the University of Cologne, where he is a chaired Professor of Information Systems for a Sustainable Society. At the Rotterdam School of Management, Erasmus University, he is Professor of Next Generation Information Systems as well as Director of the Erasmus Centre for Future Energy Business and Academic Director of Smart Cities and Smart Energy at the Erasmus Centre of Data Analytics. He has been a visiting professor at the Haas School of Business and Berkeley Institute of Data Science, University of California at Berkeley in 2016 to 2017.
Proximal gradient methods for learning
Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is ℓ 1 {\displaystyle \ell _{1}} regularization (also known as Lasso) of the form min w ∈ R d 1 n ∑ i = 1 n ( y i − ⟨ w , x i ⟩ ) 2 + λ ‖ w ‖ 1 , where x i ∈ R d and y i ∈ R . {\displaystyle \min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{1},\quad {\text{ where }}x_{i}\in \mathbb {R} ^{d}{\text{ and }}y_{i}\in \mathbb {R} .} Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application. Such customized penalties can help to induce certain structure in problem solutions, such as sparsity (in the case of lasso) or group structure (in the case of group lasso). == Relevant background == Proximal gradient methods are applicable in a wide variety of scenarios for solving convex optimization problems of the form min x ∈ H F ( x ) + R ( x ) , {\displaystyle \min _{x\in {\mathcal {H}}}F(x)+R(x),} where F {\displaystyle F} is convex and differentiable with Lipschitz continuous gradient, R {\displaystyle R} is a convex, lower semicontinuous function which is possibly nondifferentiable, and H {\displaystyle {\mathcal {H}}} is some set, typically a Hilbert space. The usual criterion of x {\displaystyle x} minimizes F ( x ) + R ( x ) {\displaystyle F(x)+R(x)} if and only if ∇ ( F + R ) ( x ) = 0 {\displaystyle \nabla (F+R)(x)=0} in the convex, differentiable setting is now replaced by 0 ∈ ∂ ( F + R ) ( x ) , {\displaystyle 0\in \partial (F+R)(x),} where ∂ φ {\displaystyle \partial \varphi } denotes the subdifferential of a real-valued, convex function φ {\displaystyle \varphi } . Given a convex function φ : H → R {\displaystyle \varphi :{\mathcal {H}}\to \mathbb {R} } an important operator to consider is its proximal operator prox φ : H → H {\displaystyle \operatorname {prox} _{\varphi }:{\mathcal {H}}\to {\mathcal {H}}} defined by prox φ ( u ) = arg min x ∈ H φ ( x ) + 1 2 ‖ u − x ‖ 2 2 , {\displaystyle \operatorname {prox} _{\varphi }(u)=\operatorname {arg} \min _{x\in {\mathcal {H}}}\varphi (x)+{\frac {1}{2}}\|u-x\|_{2}^{2},} which is well-defined because of the strict convexity of the ℓ 2 {\displaystyle \ell _{2}} norm. The proximal operator can be seen as a generalization of a projection. We see that the proximity operator is important because x ∗ {\displaystyle x^{}} is a minimizer to the problem min x ∈ H F ( x ) + R ( x ) {\displaystyle \min _{x\in {\mathcal {H}}}F(x)+R(x)} if and only if x ∗ = prox γ R ( x ∗ − γ ∇ F ( x ∗ ) ) , {\displaystyle x^{}=\operatorname {prox} _{\gamma R}\left(x^{}-\gamma \nabla F(x^{})\right),} where γ > 0 {\displaystyle \gamma >0} is any positive real number. === Moreau decomposition === One important technique related to proximal gradient methods is the Moreau decomposition, which decomposes the identity operator as the sum of two proximity operators. Namely, let φ : X → R {\displaystyle \varphi :{\mathcal {X}}\to \mathbb {R} } be a lower semicontinuous, convex function on a vector space X {\displaystyle {\mathcal {X}}} . We define its Fenchel conjugate φ ∗ : X → R {\displaystyle \varphi ^{}:{\mathcal {X}}\to \mathbb {R} } to be the function φ ∗ ( u ) := sup x ∈ X ⟨ x , u ⟩ − φ ( x ) . {\displaystyle \varphi ^{}(u):=\sup _{x\in {\mathcal {X}}}\langle x,u\rangle -\varphi (x).} The general form of Moreau's decomposition states that for any x ∈ X {\displaystyle x\in {\mathcal {X}}} and any γ > 0 {\displaystyle \gamma >0} that x = prox γ φ ( x ) + γ prox φ ∗ / γ ( x / γ ) , {\displaystyle x=\operatorname {prox} _{\gamma \varphi }(x)+\gamma \operatorname {prox} _{\varphi ^{}/\gamma }(x/\gamma ),} which for γ = 1 {\displaystyle \gamma =1} implies that x = prox φ ( x ) + prox φ ∗ ( x ) {\displaystyle x=\operatorname {prox} _{\varphi }(x)+\operatorname {prox} _{\varphi ^{}}(x)} . The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a vector space, analogous with the fact that proximity operators are generalizations of projections. In certain situations it may be easier to compute the proximity operator for the conjugate φ ∗ {\displaystyle \varphi ^{}} instead of the function φ {\displaystyle \varphi } , and therefore the Moreau decomposition can be applied. This is the case for group lasso. == Lasso regularization == Consider the regularized empirical risk minimization problem with square loss and with the ℓ 1 {\displaystyle \ell _{1}} norm as the regularization penalty: min w ∈ R d 1 n ∑ i = 1 n ( y i − ⟨ w , x i ⟩ ) 2 + λ ‖ w ‖ 1 , {\displaystyle \min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{1},} where x i ∈ R d and y i ∈ R . {\displaystyle x_{i}\in \mathbb {R} ^{d}{\text{ and }}y_{i}\in \mathbb {R} .} The ℓ 1 {\displaystyle \ell _{1}} regularization problem is sometimes referred to as lasso (least absolute shrinkage and selection operator). Such ℓ 1 {\displaystyle \ell _{1}} regularization problems are interesting because they induce sparse solutions, that is, solutions w {\displaystyle w} to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem min w ∈ R d 1 n ∑ i = 1 n ( y i − ⟨ w , x i ⟩ ) 2 + λ ‖ w ‖ 0 , {\displaystyle \min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{0},} where ‖ w ‖ 0 {\displaystyle \|w\|_{0}} denotes the ℓ 0 {\displaystyle \ell _{0}} "norm", which is the number of nonzero entries of the vector w {\displaystyle w} . Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors. === Solving for L1 proximity operator === For simplicity we restrict our attention to the problem where λ = 1 {\displaystyle \lambda =1} . To solve the problem min w ∈ R d 1 n ∑ i = 1 n ( y i − ⟨ w , x i ⟩ ) 2 + ‖ w ‖ 1 , {\displaystyle \min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\|w\|_{1},} we consider our objective function in two parts: a convex, differentiable term F ( w ) = 1 n ∑ i = 1 n ( y i − ⟨ w , x i ⟩ ) 2 {\displaystyle F(w)={\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}} and a convex function R ( w ) = ‖ w ‖ 1 {\displaystyle R(w)=\|w\|_{1}} . Note that R {\displaystyle R} is not strictly convex. Let us compute the proximity operator for R ( w ) {\displaystyle R(w)} . First we find an alternative characterization of the proximity operator prox R ( x ) {\displaystyle \operatorname {prox} _{R}(x)} as follows: u = prox R ( x ) ⟺ 0 ∈ ∂ ( R ( u ) + 1 2 ‖ u − x ‖ 2 2 ) ⟺ 0 ∈ ∂ R ( u ) + u − x ⟺ x − u ∈ ∂ R ( u ) . {\displaystyle {\begin{aligned}u=\operatorname {prox} _{R}(x)\iff &0\in \partial \left(R(u)+{\frac {1}{2}}\|u-x\|_{2}^{2}\right)\\\iff &0\in \partial R(u)+u-x\\\iff &x-u\in \partial R(u).\end{aligned}}} For R ( w ) = ‖ w ‖ 1 {\displaystyle R(w)=\|w\|_{1}} it is easy to compute ∂ R ( w ) {\displaystyle \partial R(w)} : the i {\displaystyle i} th entry of ∂ R ( w ) {\displaystyle \partial R(w)} is precisely ∂ | w i | = { 1 , w i > 0 − 1 , w i < 0 [ − 1 , 1 ] , w i = 0. {\displaystyle \partial |w_{i}|={\begin{cases}1,&w_{i}>0\\-1,&w_{i}<0\\\left[-1,1\right],&w_{i}=0.\end{cases}}} Using the recharacterization of the proximity operator given above, for the choice of R ( w ) = ‖ w ‖ 1 {\displaystyle R(w)=\|w\|_{1}} and γ > 0 {\displaystyle \gamma >0} we have that prox γ R ( x ) {\displaystyle \operatorname {prox} _{\gamma R}(x)} is defined entrywise by ( prox γ R ( x ) ) i = { x i − γ , x i > γ 0 , | x i | ≤ γ x i + γ , x i < − γ , {\displaystyle \left(\operatorname {prox} _{\gamma R}(x)\right)_{i}={\begin{cases}x_{i}-\gamma ,&x_{i}>\gamma \\0,&|x_{i}|\leq \gamma \\x_{i}+\gamma ,&x_{i}<-\gamma ,\end{cases}}} which is known as the soft thresholding operator S γ ( x ) = prox γ ‖ ⋅ ‖ 1 ( x ) {\displaystyle S_{\gamma }(x)=\operatorname {prox} _{\gamma \|\cdot \|_{1}}(x)} . === Fixed point iterative schemes === To finally solve the lasso problem we consider the fixed point equation shown earlier: x ∗ = prox γ R ( x ∗ − γ ∇ F ( x ∗ ) ) . {\displaystyle x^{}=\operatorname {prox} _{\gamma R}\left(x^{}-\gamma \nabla F(x^{})\right).} Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial w 0 ∈ R d {\displaystyle w^{0}\in \mathbb {R} ^{d}} , and for k = 1 , 2 , … {\displaystyle k=1,2,\ldots } define w k + 1 = S γ ( w k − γ ∇ F ( w k ) ) . {\displaystyle w^{k+1}=S_{\gamma }\left(w^{k}-\gamma \nabla F\l
Stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. == Background == Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: Q ( w ) = 1 n ∑ i = 1 n Q i ( w ) , {\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),} where the parameter w {\displaystyle w} that minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated. Each summand function Q i {\displaystyle Q_{i}} is typically associated with the i {\displaystyle i} -th observation in the data set (used for training). In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. There, Q i ( w ) {\displaystyle Q_{i}(w)} is the value of the loss function at i {\displaystyle i} -th example, and Q ( w ) {\displaystyle Q(w)} is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: w := w − η ∇ Q ( w ) = w − η n ∑ i = 1 n ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w).} The step size is denoted by η {\displaystyle \eta } (sometimes called the learning rate in machine learning) and here " := {\displaystyle :=} " denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems. == Iterative method == In stochastic (or "on-line") gradient descent, the true gradient of Q ( w ) {\displaystyle Q(w)} is approximated by a gradient at a single sample: w := w − η ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q_{i}(w).} As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as : A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates η {\displaystyle \eta } decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem. == Linear regression == Suppose we want to fit a straight line y ^ = w 1 + w 2 x {\displaystyle {\hat {y}}=w_{1}+w_{2}x} to a training set with observations ( ( x 1 , y 1 ) , ( x 2 , y 2 ) … , ( x n , y n ) ) {\displaystyle ((x_{1},y_{1}),(x_{2},y_{2})\ldots ,(x_{n},y_{n}))} and corresponding estimated responses ( y ^ 1 , y ^ 2 , … , y ^ n ) {\displaystyle ({\hat {y}}_{1},{\hat {y}}_{2},\ldots ,{\hat {y}}_{n})} using least squares. The objective function to be minimized is Q ( w ) = ∑ i = 1 n Q i ( w ) = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( w 1 + w 2 x i − y i ) 2 . {\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y}}_{i}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.} The last line in the above pseudocode for this specific problem will become: [ w 1 w 2 ] ← [ w 1 w 2 ] − η [ ∂ ∂ w 1 ( w 1 + w 2 x i − y i ) 2 ∂ ∂ w 2 ( w 1 + w 2 x i − y i ) 2 ] = [ w 1 w 2 ] − η [ 2 ( w 1 + w 2 x i − y i ) 2 x i ( w 1 + w 2 x i − y i ) ] . {\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}\leftarrow {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.} Note that in each iteration or update step, the gradient is only evaluated at a single x i {\displaystyle x_{i}} . This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression y ^ = ∑ k ∈ 1 : m w k x k {\displaystyle {\hat {y}}=\sum _{k\in 1:m}w_{k}x_{k}} problem, stochastic gradient descent behaves differently when m < n {\displaystyle m Aleph (A Learning Engine for Proposing Hypotheses) is an inductive logic programming system introduced by Ashwin Srinivasan in 2001. As of 2022 it is still one of the most widely used inductive logic programming systems. It is based on the earlier system Progol. == Learning task == The input to Aleph is background knowledge, specified as a logic program, a language bias in the form of mode declarations, as well as positive and negative examples specified as ground facts. As output it returns a logic program which, together with the background knowledge, entails all of the positive examples and none of the negative examples. == Basic algorithm == Starting with an empty hypothesis, Aleph proceeds as follows: It chooses a positive example to generalise; if none are left, it aborts and outputs the current hypothesis. Then it constructs the bottom clause, that is, the most specific clause that is allowed by the mode declarations and covers the example. It then searches for a generalisation of the bottom clause that scores better on the chosen metric. It then adds the new clause to the hypothesis program and removes all examples that are covered by the new clause. == Search algorithm == Aleph searches for clauses in a top-down manner, using the bottom clause constructed in the preceding step to bound the search from below. It searches the refinement graph in a breadth-first manner, with tunable parameters to bound the maximal clause size and proof depth. It scores each clause using one of 13 different evaluation metrics, as chosen in advance by the user. Amazon Rekognition is a cloud-based software as a service (SaaS) computer vision platform that was launched in 2016. It has been sold to, and used by, a number of United States government agencies, including U.S. Immigration and Customs Enforcement (ICE) and Orlando, Florida police, as well as private entities. == Capabilities == Rekognition provides a number of computer vision capabilities, which can be divided into two categories: Algorithms that are pre-trained on data collected by Amazon or its partners, and algorithms that a user can train on a custom dataset. As of July 2019, Rekognition provides the following computer vision capabilities. === Pre-trained algorithms === Celebrity recognition in images Facial attribute detection in images, including gender, age range, emotions (e.g. happy, calm, disgusted), whether the face has a beard or mustache, whether the face has eyeglasses or sunglasses, whether the eyes are open, whether the mouth is open, whether the person is smiling, and the location of several markers such as the pupils and jaw line. People Pathing enables tracking of people through a video. An advertised use-case of this capability is to track sports players for post-game analysis. Text detection and classification in images Unsafe visual content detection === Algorithms that a user can train on a custom dataset === SearchFaces enables users to import a database of images with pre-labeled faces, to train a machine learning model on this database, and to expose the model as a cloud service with an API. Then, the user can post new images to the API and receive information about the faces in the image. The API can be used to expose a number of capabilities, including identifying faces of known people, comparing faces, and finding similar faces in a database. Face-based user verification == History and use == === 2017 === In late 2017, the Washington County, Oregon Sheriff's Office began using Rekognition to identify suspects' faces. Rekognition was marketed as a general-purpose computer vision tool, and an engineer working for Washington County decided to use the tool for facial analysis of suspects. Rekognition was offered to the department for free, and Washington County became the first US law enforcement agency known to use Rekognition. In 2018, the agency logged over 1,000 facial searches. The county, according to the Washington Post, by 2019 was paying about $7 a month for all of its searches. The relationship was unknown to the public until May 2018. In 2018, Rekognition was also used to help identify celebrities during a royal wedding telecast. === 2018 === In April 2018, it was reported that FamilySearch was using Rekognition to enable their users to "see which of their ancestors they most resemble based on family photographs". In early 2018, the FBI also began using it as a pilot program for analyzing video surveillance. In May 2018, it was reported by the ACLU that Orlando, Florida was running a pilot using Rekognition for facial analysis in law enforcement, with that pilot ending in July 2019. After the report, on June 22, 2018, Gizmodo reported that Amazon workers had written a letter to CEO Jeff Bezos requesting he cease selling Rekognition to US law enforcement, particularly ICE and Homeland Security. A letter was also sent to Bezos by the ACLU. On June 26, 2018, it was reported that the Orlando police force had ceased using Rekognition after their trial contract expired, reserving the right to use it in the future. The Orlando Police Department said that they had "never gotten to the point to test images" due to old infrastructure and low bandwidth. In July 2018, the ACLU released a test showing that Rekognition had falsely matched 28 members of Congress with mugshot photos, particularly Congresspeople of color. 25 House members afterwards sent a letter to Bezos, expressing concern about Rekognition. Amazon responded saying the Rekognition test had generated 80 percent confidence, while it recommended law enforcement only use matches rated at 99 percent confidence. The Washington Post states that Oregon instead has officers pick a "best of five" result, instead of adhering to the recommendation. In September 2018, it was reported that Mapillary was using Rekognition to read the text on parking signs (e.g. no stopping, no parking, or specific parking hours) in cities. In October 2018, it was reported that Amazon had earlier that year pitched Rekognition to U.S. Immigration and Customs Enforcement agency. Amazon defended government use of Rekognition. On December 1, 2018, it was reported that 8 Democratic lawmakers had said in a letter that Amazon had "failed to provide sufficient answers" about Rekognition, writing that they had "serious concerns that this type of product has significant accuracy issues, places disproportionate burdens on communities of color, and could stifle Americans' willingness to exercise their First Amendment rights in public." === 2019 === In January 2019, MIT researchers published a peer-reviewed study asserting that Rekognition had more difficulty in identifying dark-skinned females than competitors such as IBM and Microsoft. In the study, Rekognition misidentified darker-skinned women as men 31% of the time, but made no mistakes for light-skinned men. Amazon called the report "misinterpreted results" of the research with an improper "default confidence threshold." In January 2019, Amazon's shareholders "urged Amazon to stop selling Rekognition software to law enforcement agencies." Amazon in response defended its use of Rekognition, but supported new federal oversight and guidelines to "make sure facial recognition technology cannot be used to discriminate." In February 2019, it was reported that Amazon was collaborating with the National Institute of Standards and Technology (NIST) on developing standardized tests to improve accuracy and remove bias with facial recognition. In March 2019, an open letter regarding Rekognition was sent by a group of prominent AI researchers to Amazon, criticizing its sale to law enforcement with around 50 signatures. In April 2019, Amazon was told by the Securities and Exchange Commission that they had to vote on two shareholder proposals seeking to limit Rekognition. Amazon argued that the proposals were an "insignificant public policy issue for the Company" not related to Amazon's ordinary business, but their appeal was denied. The vote was set for May. The first proposal was tabled by shareholders. On May 24, 2019, 2.4% of shareholders voted to stop selling Rekognition to government agencies, while a second proposal calling for a study into Rekognition and civil rights had 27.5% support. In August 2019, the ACLU again used Rekognition on members of government, with 26 of 120 lawmakers in California flagged as matches to mugshots. Amazon stated the ACLU was "misusing" the software in the tests, by not dismissing results that did not meet Amazon's recommended accuracy threshold of 99%. By August 2019, there had been protests against ICE's use of Rekognition to surveil immigrants. In March 2019, Amazon announced a Rekognition update that would improve emotional detection, and in August 2019, "fear" was added to emotions that Rekognition could detect. === 2020 === In June 2020, Amazon announced it was implementing a one-year moratorium on police use of Rekognition, in response to the George Floyd protests. === 2024 === The Department of Justice disclosed that the FBI is initiating the use of Amazon Rekognition. The DOJ's AI inventory revealed the FBI's "Project Tyr" aims to customize Rekognition to identify nudity, weapons, explosives, and other information from lawfully acquired media. === 2025 === In late 2025, the New York Times reported that scientist, Dr. Jürgen Matthäus, retired from as the head of research at the U.S. Holocaust Memorial Museum in Washington, D.C., used Amazon Rekognition to identify the shooter in the Holocaust photograph known as The Last Jew in Vinnitsa "with more than 99 percent certainty" — as Jakobus Onnen (1906–1943), a teacher from Tichelwarf near Weener in East Frisia who had been a member of the SS since 1934 and was later killed in action near Zhitomir in 1943. The photographer and victim remain unidentified. == Controversy regarding facial analysis == === Racial and gender bias === In 2018, MIT researchers Joy Buolamwini and Timnit Gebru published a study called Gender Shades. In this study, a set of images was collected, and faces in the images were labeled with face position, gender, and skin tone information. The images were run through SaaS facial recognition platforms from Face++, IBM, and Microsoft. In all three of these platforms, the classifiers performed best on male faces (with error rates on female faces being 8.1% to 20.6% higher than error rates on male faces), and they performed worst on dark female faces (with error rates ranging from 20.8% to 30.4%). The authors hypothesized that this discr The Deep Learning Indaba is an annual conference and educational event that aims to strengthen machine learning and artificial intelligence (AI) capacity across Africa. Launched in 2017, it brings together students, researchers, industry practitioners, and policymakers from across the African continent. == History == The Deep Learning Indaba began in 2017 at the University of the Witwatersrand with over 300 participants from 23 African countries, offering tutorials in advanced AI topics and featuring notable speakers like Nando de Freitas. In 2018, it expanded to 650 delegates at Stellenbosch University, introducing parallel sessions to encourage collaboration. The 2019 edition in Nairobi, Kenya, reflected further growth, with increasing sponsorship and support from major tech companies like Google and Microsoft. === Deep Learning IndabaX === In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Online machine learning algorithms find applications in a wide variety of fields such as sponsored search to maximize ad revenue, portfolio optimization, shortest path prediction (with stochastic weights, e.g. traffic on roads for a maps application), spam filtering, real-time fraud detection, dynamic pricing for e-commerce, etc. There is also growing interest in usage of online learning paradigms for LLMs to enable continuous, real-time adaptation after the initial training. == Introduction == In the setting of supervised learning, a function of f : X → Y {\displaystyle f:X\to Y} is to be learned, where X {\displaystyle X} is thought of as a space of inputs and Y {\displaystyle Y} as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p ( x , y ) {\displaystyle p(x,y)} on X × Y {\displaystyle X\times Y} . In reality, the learner never knows the true distribution p ( x , y ) {\displaystyle p(x,y)} over instances. Instead, the learner usually has access to a training set of examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} . In this setting, the loss function is given as V : Y × Y → R {\displaystyle V:Y\times Y\to \mathbb {R} } , such that V ( f ( x ) , y ) {\displaystyle V(f(x),y)} measures the difference between the predicted value f ( x ) {\displaystyle f(x)} and the true value y {\displaystyle y} . The ideal goal is to select a function f ∈ H {\displaystyle f\in {\mathcal {H}}} , where H {\displaystyle {\mathcal {H}}} is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms. == Statistical view of online learning == In statistical learning models, the training sample ( x i , y i ) {\displaystyle (x_{i},y_{i})} are assumed to have been drawn from the true distribution p ( x , y ) {\displaystyle p(x,y)} and the objective is to minimize the expected "risk" I [ f ] = E [ V ( f ( x ) , y ) ] = ∫ V ( f ( x ) , y ) d p ( x , y ) . {\displaystyle I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .} A common paradigm in this situation is to estimate a function f ^ {\displaystyle {\hat {f}}} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input ( x t + 1 , y t + 1 ) {\displaystyle (x_{t+1},y_{t+1})} , the current best predictor f t {\displaystyle f_{t}} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where f t + 1 {\displaystyle f_{t+1}} is permitted to depend on f t {\displaystyle f_{t}} and all previous data points ( x 1 , y 1 ) , … , ( x t , y t ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{t},y_{t})} . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques. A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of b ≥ 1 {\displaystyle b\geq 1} data points at a time, this can be considered as pseudo-online learning for b {\displaystyle b} much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks. === Example: linear least squares === The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions. === Batch learning === Consider the setting of supervised learning with f {\displaystyle f} being a linear function to be learned: f ( x j ) = ⟨ w , x j ⟩ = w ⋅ x j {\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}} where x j ∈ R d {\displaystyle x_{j}\in \mathbb {R} ^{d}} is a vector of inputs (data points) and w ∈ R d {\displaystyle w\in \mathbb {R} ^{d}} is a linear filter vector. The goal is to compute the filter vector w {\displaystyle w} . To this end, a square loss function V ( f ( x j ) , y j ) = ( f ( x j ) − y j ) 2 = ( ⟨ w , x j ⟩ − y j ) 2 {\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}} is used to compute the vector w {\displaystyle w} that minimizes the empirical loss I n [ w ] = ∑ j = 1 n V ( ⟨ w , x j ⟩ , y j ) = ∑ j = 1 n ( x j T w − y j ) 2 {\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}} where y j ∈ R . {\displaystyle y_{j}\in \mathbb {R} .} Let X {\displaystyle X} be the i × d {\displaystyle i\times d} data matrix and y ∈ R i {\displaystyle y\in \mathbb {R} ^{i}} is the column vector of target values after the arrival of the first i {\displaystyle i} data points. Assuming that the covariance matrix Σ i = X T X {\displaystyle \Sigma _{i}=X^{\mathsf {T}}X} is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution f ∗ ( x ) = ⟨ w ∗ , x ⟩ {\displaystyle f^{}(x)=\langle w^{},x\rangle } to the linear least squares problem is given by w ∗ = ( X T X ) − 1 X T y = Σ i − 1 ∑ j = 1 i x j y j . {\displaystyle w^{}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.} Now, calculating the covariance matrix Σ i = ∑ j = 1 i x j x j T {\displaystyle \Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}} takes time O ( i d 2 ) {\displaystyle O(id^{2})} , inverting the d × d {\displaystyle d\times d} matrix takes time O ( d 3 ) {\displaystyle O(d^{3})} , while the rest of the multiplication takes time O ( d 2 ) {\displaystyle O(d^{2})} , giving a total time of O ( i d 2 + d 3 ) {\displaystyle O(id^{2}+d^{3})} . When there are n {\displaystyle n} total points in the dataset, to recompute the solution after the arrival of every datapoint i = 1 , … , n {\displaystyle i=1,\ldots ,n} , the naive approach will have a total complexity O ( n 2 d 2 + n d 3 ) {\displaystyle O(n^{2}d^{2}+nd^{3})} . Note that when storing the matrix Σ i {\displaystyle \Sigma _{i}} , then updating it at each step needs only adding x i + 1 x i + 1 T {\displaystyle x_{i+1}x_{i+1}^{\mathsf {T}}} , which takes O ( d 2 ) {\displaystyle O(d^{2})} time, reducing the total time to O ( n d 2 + n d 3 ) = O ( n d 3 ) {\displaystyle O(nd^{2}+nd^{3})=O(nd^{3})} , but with an additional storage space of O ( d 2 ) {\displaystyle O(d^{2})} to store Σ i {\displaystyle \Sigma _{i}} . === Online learning: recursive least squares === The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising w 0 = 0 ∈ R d {\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d}} and Γ 0 = I ∈ R d × d {\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}} , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: Γ i = Γ i − 1 − Γ i − 1 x i x i T Γ i − 1 1 + x i T Γ i − 1 x i {\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}} w i = w i − 1 − Γ i x i ( x i T w i − 1 − y i ) {\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{Aleph (ILP)
Amazon Rekognition
Deep Learning Indaba
Online machine learning