In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem. == History == The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by Cedric Smith. Another was proposed by H.O. Hartley in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated. Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977. Hartley's ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers, following his collaboration with Per Martin-Löf and Anders Martin-Löf. The Dempster–Laird–Rubin paper in 1977 generalized the method and sketched a convergence analysis for a wider class of problems. The Dempster–Laird–Rubin paper established the EM method as an important tool of statistical analysis. See also Meng and van Dyk (1997). The convergence analysis of the Dempster–Laird–Rubin algorithm was flawed and a correct convergence analysis was published by C. F. Jeff Wu in 1983. Wu's proof established the EM method's convergence also outside of the exponential family, as claimed by Dempster–Laird–Rubin. == Introduction == The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. That is, either missing values exist among the data, or the model can be formulated more simply by assuming the existence of further unobserved data points. For example, a mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component to which each data point belongs. Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation. The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a saddle point. In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have singularities in them, i.e., nonsensical maxima. For example, one of the solutions that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. == Description == === The symbols === Given the statistical model which generates a set X {\displaystyle \mathbf {X} } of observed data, a set of unobserved latent data or missing values Z {\displaystyle \mathbf {Z} } , and a vector of unknown parameters θ {\displaystyle {\boldsymbol {\theta }}} , along with a likelihood function L ( θ ; X , Z ) = p ( X , Z ∣ θ ) {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )=p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})} , the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the marginal likelihood of the observed data L ( θ ; X ) = p ( X ∣ θ ) = ∫ p ( X , Z ∣ θ ) d Z = ∫ p ( X ∣ Z , θ ) p ( Z ∣ θ ) d Z {\displaystyle {\begin{aligned}L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} \mid {\boldsymbol {\theta }})&=\int p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} \\&=\int p(\mathbf {X} \mid \mathbf {Z} ,{\boldsymbol {\theta }})p(\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} \end{aligned}}} However, this quantity is often intractable since Z {\displaystyle \mathbf {Z} } is unobserved and the distribution of Z {\displaystyle \mathbf {Z} } is unknown before attaining θ {\displaystyle {\boldsymbol {\theta }}} . === The EM algorithm === The EM algorithm seeks to find the maximum likelihood estimate of the marginal likelihood by iteratively applying these two steps: More succinctly, we can write it as one equation: θ ( t + 1 ) = arg max θ E Z ∼ p ( ⋅ | X , θ ( t ) ) [ log p ( X , Z | θ ) ] {\displaystyle {\boldsymbol {\theta }}^{(t+1)}=\mathop {\arg \max } _{\boldsymbol {\theta }}\operatorname {E} _{\mathbf {Z} \sim p(\cdot |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})\right]\,} === Interpretation of the variables === The typical models to which EM is applied use Z {\displaystyle \mathbf {Z} } as a latent variable indicating membership in one of a set of groups: The observed data points X {\displaystyle \mathbf {X} } may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. The missing values (aka latent variables) Z {\displaystyle \mathbf {Z} } are discrete, drawn from a fixed number of values, and with one latent variable per observed unit. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and those associated with a specific value of a latent variable (i.e., associated with all data points whose corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models. The motivation is as follows. If the value of the parameters θ {\displaystyle {\boldsymbol {\theta }}} is known, usually the value of the latent variables Z {\displaystyle \mathbf {Z} } can be found by maximizing the log-likelihood over all possible values of Z {\displaystyle \mathbf {Z} } , either simply by iterating over Z {\displaystyle \mathbf {Z} } or through an algorithm such as the Viterbi algorithm for hidden Markov models. Conversely, if we know the value of the latent variables Z {\displaystyle \mathbf {Z} } , we can find an estimate of the parameters θ {\displaystyle {\boldsymbol {\theta }}} fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both θ {\displaystyle {\boldsymbol {\theta }}} and Z {\displaystyle \mathbf {Z} } are unknown: First, initialize the parameters θ {\displaystyle {\boldsymbol {\theta }}} to some random values. Compute the probability of each possible value of Z {\displaystyle \mathbf {Z} } , given θ {\displaystyle {\boldsymbol {\theta }}} . Then, use the just-computed values of Z {\displaystyle \mathbf {Z} } to compute a better estimate for the parameters θ {\displaystyle {\boldsymbol {\theta }}} . Iterate steps 2 and 3 until convergence. The algorithm as just described monotonically approaches a local minimum of the cost function. == Properties == Although an EM iteration does increase the observed data (i.e., marginal) likelihood function, no guarantee exists that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM algorithm may co
Read more →