The EM algorithm

September 04, 2022

80-629-17A - Machine Learning for Large-Scale Data Analysis and Decision Making (Graduate course) - HEC Montreal – Homework (2018/10/12). Option Model – The EM algorithm

The Expectation Maximization (EM) is a general-purpose iterative algorithm used in the calculation of maximum likelihood estimates in cases of incomplete data. The basic intuition for the application of EM is when the likelihood maximization of the observed data is difficult to compute but the enlarge or complete case, which includes the unobserved data, makes this calculation easier. The iterative stages are usually called the E-step and M-step. This is a very popular algorithm used in the parameter estimation in models involving missing values or latent variables.

Preliminary considerations:

Having the training set { x1,,xmx1,…,xm } of observed independent examples and denoting { z1,,zmz1,…,zm } the set of all latent variables and θ\theta the set of all model parameters, we need to fit the parameters of model p(x,z)p(x,z) to the data. The corresponding log likelihood is given by:

(θ)=i=1mlogp(x;θ)=i=1mlogzp(x,z;θ).\ell(\theta)= \sum_{i=1}^m log\, p( x; \theta) = \sum_{i=1}^m log \sum_{z} p(x, z; \theta)\, .

We can observe that the summation over the latent variable zz appears inside the logarithm and this sum prevents the logarithm from acting directly on the joint distribution. Also, the marginal distribution p(X;θ)p( X; \theta) does not commonly result in exponential distribution even if the joint distribution p(X,Z;θ)p(X, Z; \theta) belongs to the exponential family (Bishop, 2006). In this regard, the maximum likelihood solution becomes a complicated expression often difficult to calculate. However, if the latent variable ZZ were observed, the complete case would be { X,ZX, Z } and the respective log likelihood would take the form of logp(X,Z;θ) log\, p(X,Z; \theta). The maximization of this complete-data log is assumed to be easier to calculate. In practice, we only have the incomplete data set XX and what we know about ZZ would be related to its posterior distribution p(ZX;θ)p (Z | X; \theta ).

Notes: the current discussion is centered around the summation over ZZ . In case ZZ is a continuous latent variable, the summation is replaced with an integral over ZZ. The semicolon in the join probability denotes a conditional on a given realization of the parameter.

Description:

We have introduced the likelihood (θ)\ell(\theta) (Ng, 2017):

(θ)=ilogp(x;θ)=ilogzip(xi,zi;θ).(1)\ell(\theta)= \sum_{i} log\, p( x; \theta) = \sum_{i} log \sum_{z_i} p(x_i, z_i; \theta)\, . \tag 1

For each ii, we denote qiq_i as some probability distribution over the zz ’s ( zqi(z)=1andqi(z)0\sum_{z} q_i(z) = 1 \, and \, q_i(z) \geq 0 ). Then we have that:

(θ)=ilogp(x;θ)=ilogziqi(zi)p(xi,zi;θ)qi(zi)(2)\ell(\theta)= \sum_{i} log\, p( x; \theta) = \sum_{i} log \sum_{z_i} q_i(z_i) \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } \tag 2
iziqi(zi)logp(xi,zi;θ)qi(zi)(3)\geq \sum_{i} \sum_{z_i} q_i(z_i) log \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } \tag 3

In step 2, we simply add qi(z)q_i(z) to the numerator and denominator. In step 3, Jensen’s inequality is used with the following considerations:

  • as the log function ff is concave (if f(x)=logxf(x) = log\,x then f′′(x)=1/x2<0overxR+f′′(x) = −1/x^2 < 0\,over\,x \in \mathbb{R}^+ ), then E[f(X)]f(E[X]) \mathbb{E}[f(X)] \leq f(\mathbb{E}[X]) ; and
  • the summation term ziqi(zi)p(xi,zi;θ)qi(zi)\sum_{z_i} q_i(z_i) \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } in step 2 can be considered as the expectation of quantity p(xi,zi;θ)qi(zi)\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } with respect to ziz_i drawn according to the distribution given by qiq_i.

This can be also represented as Eziqi[p(xi,zi;θ)qi(zi)]\mathbb{E}_{z_i \sim q_i}[\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }] where the subscript ziqiz_i \sim q_i indicates that the expectation is with respect to qiq_i over ziz_i. Finally, the Jensen’s inequality gives us

log(Eziqi[p(xi,zi;θ)qi(zi)])Eziqi[log(p(xi,zi;θ))qi(zi)]log ( \mathbb{E}_{z_i \sim q_i}[\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }] ) \geq \mathbb{E}_{z_i \sim q_i}[log \frac{ ( p(x_i, z_i; \theta) ) }{ q_i(z_i) } ]

and we use this to go from step 2 to step 3. Critically, the result in step 3 gives us a lower bound for (θ)\ell(\theta) for any set of distributions qiq_i.

We hold parameters θ\theta at any given value (or current guess) and analyze the best choice of qiq_i. Intuitively, the best choice is the one that makes the gap between the lower bound of (θ)\ell(\theta) and (θ)\ell(\theta) the tightest. The Jensen’s inequality becomes the equality E[f(X)]=f(E[X])\mathbb{E}[f(X)] = f( \mathbb{E}[X] ) when ff is linear or XX is constant. Therefore, for a particular θ\theta , the derivation in step 3 becomes an equality (the gap becomes zero) when the expectation is taken over a constant (log function is not linear). In other words, we need that

p(xi,zi;θ)qi(zi)=c \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } = c

for a constant cc that does not depend on ziz_i. We can assure this by choosing

qi(zi)p(xi,zi;θ).q_i(z_i) \propto p(x_i,z_i; \theta)\, .

Because qiq_i is a probability distribution we must enforce that zqi(z)=1\sum_{z} q_i(z) = 1 . Then we obtain the following derivations:

qi(zi)=p(xi,zi;θ)zp(xi,zi,θ)(standarize to sum 1)q_i(z_i) = \frac{ p(x_i, z_i; \theta) }{ \sum_{z} p(x_i, z_i, \theta) } \tag*{(standarize to sum 1)}
=p(xi,zi;θ)p(xi;θ)(zi has been "marginalized out")= \frac{ p(x_i,z_i;\theta) }{ p(x_i;\theta) } \tag*{($ z_i $ has been "marginalized out")}
=p(zixi;θ)(by conditional probability)= p(z_i|x_i; \theta) \tag*{(by conditional probability)}

This result implies that, when setting the qiq_i ’s as the posterior distribution of ziz_i conditional on xix_i and a general parameter value θ\theta, the log likelihood (θ)\ell(\theta) is equal to the lower bound (the tightest bound possible), making a tangential contact at the given θ\theta. This is called the E-step. For the next stage, the M-step, we try to maximize (“push-up”) the lower bound of log likelihood (θ)\ell(\theta) as stated in step 3 with respect to parameters θ\theta in logp(xi,zi;θ)log\, p(x_i, z_i; \theta) while keeping qiq_i from the E-step. The E-step and M-step are repeated until a measure of convergence in the computation of lower bound or parameters is achieved.

In summary, EM algorithm would be implemented as:

  • For a joint distribution p(XZ;θ)p(X|Z; \theta) governed by parameters θ\theta, we try to maximize (θ)=logp(X;θ)\ell(\theta) = log\,p(X; \theta) with respect to θ\theta.

1- Choose initial values for θ\theta.

  • Repeat 2 and 3 until convergence:

2- E-step: for each 𝑖

qi(zi):=p(zixi;θ)q_i(z_i) := p(z_i| x_i; \theta)

3- M-step: Set

θ=argmaxθiziqi(zi)logp(xi,zi;θ)qi(zi)\theta = arg\, max_{\theta} \sum_{i} \sum_{z_i} q_i(z_i) log \frac{ p(x_i,z_i;\theta)}{ q_i(z_i) }

In the E-step we use the current value or guess θt\theta^t to find the posterior distribution of the unobserved/latent data p(xi,zi;θ)p(x_i,z_i;\theta). Our knowledge of ZZ is given only by this posterior distribution. In general, we cannot use the complete-data log likelihood, but we can use instead its expected value under the posterior distribution of the latent variable. In the M-step we maximize this expectation with respect to θ\theta – as found in the complete-data logp(xi,zi;θ)log\, p(x_i, z_i; \theta) - but using the expectation given by posterior qi(zi):=p(zixi;θt)q_i(z_i) := p(z_i| x_i; \theta^t) found in the E-step. We can also show that, in the M-step, the maximization target function can be stated as: iziqi(zi)logp(xi,zi;θ)\sum_{i} \sum_{z_i} q_i(z_i) log\, p(x_i,z_i;\theta). It is because the denominator qi(zi)q_i(z_i) in the log term of M-step’s target function does not depend on the θ\theta that we try to maximize and can be considered as constant during the optimization process.

The lower bound of (θ)\ell(\theta) can be also expressed as Q(θ,q)Q(\theta, q) (Murphy, 2012), where:

Q(θ,q)=iEqi[logp(xi,zi,θ)]+H(qi),whereH(qi)istheShannonentropyofqi.Q(\theta,q)= \sum_{i}\mathbb{E}_{q_i}[ log\,p(x_i, z_i, \theta)] + \mathbb{H}(q_i), \\ where\, \mathbb{H}(q_i)\,is\,the\,Shannon\,entropy\,of\, q_i.

This is just a direct consequence of the properties of expectations and the log function (log a/b = log a – log b) as well as the Shannon’s entropy definition ( H(X)=i=1np(xi)logp(xi)\mathbb{H}(X) = - \sum_{i=1}^n p(x_i)log\,p(x_i)). As stated before, H(qi)\mathcal{H}(q_i) does not depend on θ\theta. Furthermore, through the steps in Annex A, Murphy (2012) decomposes the lower bound Q(θ,q)Q(\theta, q) (inferred in (3)) as the sum over ii of the terms (θ,qi)=ziqi(zi)logp(xi,zi;θ)qi(zi)\ell(\theta, q_i) = \sum_{z_i}q_i(z_i)log\, \frac{ p(x_i,z_i;\theta) }{ q_i(z_i) } and obtains:

L(θ,qi)=KL(qi(zi)p(xxi,θ))+logp(xiθ).(4)L(\theta, q_i)= - \mathbb{KL}(q_i(z_i)\vert p( x | x_i, \theta)) + log\, p(x_i | \theta) \tag 4 .

We observe that log p(xiθ)p(x_i |\theta) does not depend on qiq_i and the lower bound L(θ,qi)L(\theta, q_i) is maximized when the Kullback–Leibler divergence (KL\mathbb{KL}) term is zero. This implies that:

  • the best choice for qi(zi)q_i(z_i) is to be set equal to p(xxi,θ)p( x | x_i, \theta) as the KL divergence (relative entropy; the measure of divergence between two distributions) is zero only when the both probabilities are the same (KL is positive when the probabilities are different). This is equivalent to the “ best choice of qiq_i ” step described before.
  • when the KL\mathbb{KL} term is zero, lower bound L(θ,qi)L(\theta, q_i) is maximized and it is equal to log p(xiθ)p(x_i|\theta) for a given choice of θ\theta.

For the E-step: because θ\theta is not known, we chose an estimated θt\theta^t and set qit(zi)=p(xi,zi;θt)q_i^t(z_i) = p(x_i,z_i; \theta^t). By making the KL\mathbb{KL} divergence zero, now the lower bound L(θ,qi)L(\theta, q_i) equals the function logp(xiθt)log\, p(x_i | \theta^t) at θt\theta^t. When considering the sum over ii we get that the lower bound of (θ)\ell(\theta) denoted as Q(qt,qt)Q(q_t,q_t) equals (θt)\ell(\theta^t). Once the gap is zero, the next step is to “push-up” the lower bound. For the M-step: to “push-up” the lower bound, we maximize Q(θ,qt)=iEqit[logp(xi,zi;θ)]Q(\theta, q_t) = \sum_{i} \mathbb{E}_{q_i^t}[log\,p(x_i,z_i; \theta)] (we can drop H(qit)\mathbb{H}(q_i^t)) with respect to θ\theta while keeping qitq_i^t. We get a new estimate of the parameters to be used in the next E-step:

θt+1=argmaxθQ(θ,θt)=argmaxθiEqit[logp(xi,zi;θ)]\theta^{t+1} = arg\, max_{\theta}Q(\theta, \theta^t) = arg\, max_{\theta} \sum_{i} \mathbb{E}_{q_i^t}[log\,p(x_i,z_i; \theta)]

The results are equivalent to those presented before; however, this representation helps understand the iterative process in terms of the KL divergence: when we maximize the lower bound Q(θ,qt)Q(\theta, q^t) in the M-step, the lower bound increases but the (θ)\ell(\theta) will increase more due to the KL divergence. The reason for this is that the maximization of the lower bound uses the distribution qitq_i^t based on θt\theta^t to find the new θt+1\theta^{t+1}. This will increase the KL divergence as the new posterior distribution p(zixi;θt+1)p(z_i| x_i; \theta^{t+1}) will be different than that of qit(zi)=p(xi,zi;θt)q_i^t(z_i) = p(x_i, z_i; \theta^t). In this regard, unless (θ)\ell(\theta) is at maximum already, the increase in (θ)\ell(\theta) is at least as high as the increase of the lower bound in the M-step. The iterations are needed to until convergence.
Note: Murphy (2012) starts defining Q(θ,q)Q(\theta, q) to later introduce Q(θ,θt) Q(\theta, \theta^t). This is just to highlight that qq depends on the current values of θt\theta^t.

Source: Murphy (p.365, 2012)
Log likehood and lower bound; Source: Murphy (p.365, 2012)

Source: Murphy (p.365, 2012). The dashed red curve represents the observed-data log likelihood (θ)\ell(\theta) while the blue and green lines represent, respectively, the lower bounds Q(θ,θt) Q(\theta, \theta^t) and Q(θ,θt+1) Q(\theta, \theta^{t+1}) tangential to (θ)\ell(\theta) at θt\theta^t and θt+1\theta^{t+1}. The E-step makes Q(θ,θt)Q(\theta, \theta^t) to touch (θ)\ell(\theta) at θt\theta^t by assigning qit(zi)=p(xi,zi;θt)q_i^t(z_i)=p(x_i,z_i; \theta^t). In the M-step we find θt+1\theta^{t+1} by maximizing Q(θ,θt)Q(\theta, \theta^t) . The difference between (θt+1) \ell(\theta^{t+1}) and Q(θt+1,θt)Q(\theta^{t+1}, \theta^t) correspond to the KL\mathbb{KL} divergence. In the next E-step we produce a new lower bound Q(θt,θt+1)Q(\theta^{t}, \theta^{t+1}) by setting qit+1(zi)=p(xi,zi;θt+1)q_i^{t+1}(z_i) = p(x_i, z_i; \theta^{t+1}). E-step and M-step are repeated until convergence.

Proof of convergence:

We would need to show that the EM iterations monotonically increases the observed data log likelihood (θ)\ell(\theta) until it finds a local optimum. First, we have that, for a general θ\theta, (θ)Q(θ,.)\ell(\theta) \geq Q(\theta,.) (the lower bound is less or equal to (θ)\ell(\theta), particularly for θt+1\theta^{t+1}). Then, we have that, after the lower bound is maximized in the M-step, Q(θt+1,θt)=maxθQ(θ,θt)Q(θt,θt) Q(\theta^{t+1},\theta^t)= max_{\theta}\, Q(\theta,\theta^t) \geq Q(\theta^t,\theta^t). Finally, as in E-step, Q(θt,θt)=(θt)Q(\theta^t,\theta^t) = \ell(\theta^t). Therefore, we have:

(θt+1)Q(θt+1,θt)Q(θt,θt)=(θt).\ell(\theta^{t+1}) \geq Q(\theta^{t+1},\theta^t) \geq Q(\theta^t,\theta^t) = \ell(\theta^t).

Remark on coordinate ascent (Hastie et al., 2009): By defining

J(q,θ)=iziqi(zi)logp(xi,zi;θ)qi(zi).J(q,\theta)= \sum_{i}\sum_{z_i}q_i(z_i)log \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }.

we can observe that (θ)J(q,θ)\ell(\theta) \geq J(q,\theta), as presented in (3). Then the EM algorithm can be understood as a coordinate ascent on J: The E-step maximizes JJ with respect to qq and the M-Step maximized with respect to θ\theta.

Properties:

Some interesting properties of EM are:

  • Easy implementation both analytically and computationally as well as numerical stability. Easy to program and requires small storage space.

The monotone increase of likelihood can be use for debugging. The cost by iteration is relatively low compared to other algorithms. Among it uses, it can provide estimates of missing data (missing at random) as well as to incorporating latent variable modelling (McLachlan et al., 2004).

  • In addition, the EM algorithm can compute also Maximum a Posterior (MAP) estimation by incorporating prior knowledge of the estimate θ\theta.

For this, the only modification needed is in the M-step, where we would seek to maximize Q(θ,qt)+logp(θ)Q(\theta, q^t) + log\,p(\theta) (Bishop, 2006).

  • Regarding the optimization, it is possible to the log likelihood (θt)\ell(\theta^t) has multiple local maximums.

A good practice is to run the EM algorithm many times with different initial parameters for θ\theta. The ML estimate for θ\theta is that corresponding to the best local maximum found. Some drawbacks are (McLachlan et al., 2004):

  • Slow rate of convergence.
  • The E- or M-steps may not be analytically tractable.
  • The EM algorithm does not automatically provide an estimation of the parameter’s covariance matrix. Additional methodologies are needed.
Competing methods (Springer and Urban, 2014):

Newton methods for optimization can be also considered for maximum likelihood estimations. Newton methods work on any twice-differentiable function and seemingly converge faster than EM implementations. Nevertheless, Newton methods are costlier as it involves the calculation of second derivates-Hessian matrix.

Example - Mixture of Gaussians

The motivation for Gaussian mixture models (GMM) is to achieve a richer class of densities by linear superposition of Gaussian components of unknown parameters. This model can be expressed as p(x)=k=1kπkN(xμk,k)p(x) = \sum_{k=1}^k \pi_{k}\mathcal{N}(x|\mu_k, \sum_{k}) with πk\pi_{k} denoting the mixing coefficients related to K components (0πk10≤\pi_{k}≤1 and kπk=1\sum_{k}\pi_k=1). The mixing component corresponds to a type of subpopulation assignment of the data. As this assignment must be learnt by the model, GMM is a form of unsupervised learning.

Nevertheless, the log likelihood of this original model is difficult to maximize. If we use a latent variable zz [a 1-of-K binary representation: zkz_k is 1 if the element belongs to the k distribution and zizz_{i \neq z} is 0 while satisfying zkz_k \in {0,1} and kzk=1\sum_k z_k = 1], we get p(x)=zp(z)p(xz)=k=1KπkN(xμk,k)p(x)=\sum_z p(z)p(x|z) = \sum_{k=1}^K\pi_{k}\mathcal{N}(x|\mu_k, \sum_{k}) with marginal distribution over z:p(z)=k=1Kπkzkz:p(z) = \prod_{k=1}^K \pi_k^{z_k} (or equivalently p(zk=1)=πkp(z_k = 1) = \pi_k) and Gaussian conditional distribution of xx over a particular value of z:p(z)=k=1KN(xμk,k)zkz:p(z) = \prod_{k=1}^K \mathcal{N}(x|\mu_k, \sum_{k})^{z_k}. We can see that for every observed data point xix_i there is a latent variable ziz_i and as equally important, now we can work with the joint distribution p(x,z)p(x,z) in the EM algorithm. The most important realization of the latent variable version of the GMM is that the expected value of zijz_{ij} under the posterior distribution E[zij]\mathbb{E}[z_{ij}] is equivalent to qi(zij=k):=p(zij=kxi;θ) q_i(z_{ij} = k) := p(z_{ij}=k|x_i;\theta) and can be written as:

E[zij]=p(zij=kxi;θ)=wij=πkN(xiμk,k)k=1KπkN(xiμk,k)\mathbb{E}[z_{ij}] = p(z_{ij} = k | x_i;\theta) = w_{ij} = \frac{ \pi_k \mathcal{N}(x_i|\mu_k, \sum_{k}) }{ \sum_{k=1}^K \pi_k \mathcal{N}(x_i|\mu_k, \sum_{k}) }

For the EM algorithm, we start with some guess values of θ\theta: πj,μj,j\pi_j, \mu_j, \sum_j and iterate until convergence: E-step: calculate p(zij=kxi;θ)=wijp(z_{ij}=k|x_i;\theta) = w_{ij}. M-step: maximize with respect θ\theta:

i=1mj=1kqi(zij=k)logp(xizij=k;μk,k)p(zij=kπz)qi(zij=k)\sum_{i=1}^m \sum_{j=1}^k q_i(z_{ij} = k) log \frac{ p(x_i|z_{ij} = k; \mu_k, \sum_k) p(z_{ij}=k | \pi_z) }{ q_i(z_{ij} = k) }

When replacing the marginal and conditional described above, we get:

μj=i=1mwijxii=1mwij,πj=1mi=1mwijandj=i=1mwij(xiμj)(xiμj)Ti=1mwij\mu_j = \frac{ \sum_{i=1}^m w_{ij}x_i }{\sum_{i=1}^m w_{ij} }, \, \pi_j = \frac{ 1 }{m}\sum_{i=1}^m w_{ij}\,and\, \sum_j = \frac{\sum_{i=1}^m w_{ij}(x_i - \mu_j)(x_i - \mu_j)^T }{\sum_{i=1}^m w_{ij} }

One consideration is to avoid solutions that collapse in singularity (infinite likelihood when j=0\sum_j = 0 and μj\mu_j equals one data point). MAP estimation with priors on θ\theta can help with this potential problem (Bishop, 2006).

References:
  • Ng, Andrew. CS229 Lecture notes 8 - The EM Algorithm. 2017. http://cs229.stanford.edu/notes/cs229-notes8.pdf
  • Ng, Andrew. CS229 Lecture notes 7b - Mixture of Gaussians and the EM algorithm. 2017. http://cs229.stanford.edu/notes/cs229-notes7b.pdf
  • Murphy, Kevin P. Chapter 11 Mixture Models and the EM algorithm. Machine Learning: A Probabilistic Perspective, The MIT Press. 2012.r
  • Bishop, Christopher M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. 2006.
  • McLachlan, Geoffrey J.; Krishnan, Thriyambakam; Ng, See Ket. The EM Algorithm, Papers. Humboldt-Universität Berlin, Center for Applied Statistics and Economics (CASE), No. 24. 2004.
  • Hastie, Trevor; Robert Tibshirani; J H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009.
  • Springer, T., Urban, K. Numer Algor. Comparison of the EM algorithm and alternatives. Vol. 67, Issue 2, pp335. 2014.

Profile picture

Written by Miguel Ibanez Salinas who lives and works in Montreal, reading, watching and building things. You can check my linkedin