The EM algorithm

September 04, 2022

80-629-17A - Machine Learning for Large-Scale Data Analysis and Decision Making (Graduate course) - HEC Montreal – Homework (2018/10/12). Option Model – The EM algorithm

The Expectation Maximization (EM) is a general-purpose iterative algorithm used in the calculation of maximum likelihood estimates in cases of incomplete data. The basic intuition for the application of EM is when the likelihood maximization of the observed data is difficult to compute but the enlarge or complete case, which includes the unobserved data, makes this calculation easier. The iterative stages are usually called the E-step and M-step. This is a very popular algorithm used in the parameter estimation in models involving missing values or latent variables.

Preliminary considerations:

Having the training set { $x1,…,xm$ } of observed independent examples and denoting { $z1,…,zm$ } the set of all latent variables and $\theta$ the set of all model parameters, we need to fit the parameters of model $p(x,z)$ to the data. The corresponding log likelihood is given by:

\ell(\theta)= \sum_{i=1}^m log\, p( x; \theta) = \sum_{i=1}^m log \sum_{z} p(x, z; \theta)\, .

We can observe that the summation over the latent variable $z$ appears inside the logarithm and this sum prevents the logarithm from acting directly on the joint distribution. Also, the marginal distribution $p( X; \theta)$ does not commonly result in exponential distribution even if the joint distribution $p(X, Z; \theta)$ belongs to the exponential family (Bishop, 2006). In this regard, the maximum likelihood solution becomes a complicated expression often difficult to calculate. However, if the latent variable $Z$ were observed, the complete case would be { $X, Z$ } and the respective log likelihood would take the form of $log\, p(X,Z; \theta)$ . The maximization of this complete-data log is assumed to be easier to calculate. In practice, we only have the incomplete data set $X$ and what we know about $Z$ would be related to its posterior distribution $p (Z | X; \theta )$ .

Notes: the current discussion is centered around the summation over $Z$ . In case $Z$ is a continuous latent variable, the summation is replaced with an integral over $Z$ . The semicolon in the join probability denotes a conditional on a given realization of the parameter.

Description:

We have introduced the likelihood $\ell(\theta)$ (Ng, 2017):

\ell(\theta)= \sum_{i} log\, p( x; \theta) = \sum_{i} log \sum_{z_i} p(x_i, z_i; \theta)\, . \tag 1

For each $i$ , we denote $q_i$ as some probability distribution over the $z$ ’s ( $\sum_{z} q_i(z) = 1 \, and \, q_i(z) \geq 0$ ). Then we have that:

\ell(\theta)= \sum_{i} log\, p( x; \theta) = \sum_{i} log \sum_{z_i} q_i(z_i) \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } \tag 2

\geq \sum_{i} \sum_{z_i} q_i(z_i) log \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } \tag 3

In step 2, we simply add $q_i(z)$ to the numerator and denominator. In step 3, Jensen’s inequality is used with the following considerations:

as the log function $f$ is concave (if $f(x) = log\,x$ then $f′′(x) = −1/x^2 < 0\,over\,x \in \mathbb{R}^+$ ), then $\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])$ ; and
the summation term $\sum_{z_i} q_i(z_i) \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }$ in step 2 can be considered as the expectation of quantity $\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }$ with respect to $z_i$ drawn according to the distribution given by $q_i$ .

This can be also represented as $\mathbb{E}_{z_i \sim q_i}[\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }]$ where the subscript $z_i \sim q_i$ indicates that the expectation is with respect to $q_i$ over $z_i$ . Finally, the Jensen’s inequality gives us

log ( \mathbb{E}_{z_i \sim q_i}[\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }] ) \geq \mathbb{E}_{z_i \sim q_i}[log \frac{ ( p(x_i, z_i; \theta) ) }{ q_i(z_i) } ]

and we use this to go from step 2 to step 3. Critically, the result in step 3 gives us a lower bound for $\ell(\theta)$ for any set of distributions $q_i$ .

We hold parameters $\theta$ at any given value (or current guess) and analyze the best choice of $q_i$ . Intuitively, the best choice is the one that makes the gap between the lower bound of $\ell(\theta)$ and $\ell(\theta)$ the tightest. The Jensen’s inequality becomes the equality $\mathbb{E}[f(X)] = f( \mathbb{E}[X] )$ when $f$ is linear or $X$ is constant. Therefore, for a particular $\theta$ , the derivation in step 3 becomes an equality (the gap becomes zero) when the expectation is taken over a constant (log function is not linear). In other words, we need that

\frac{ p(x_i, z_i; \theta) }{ q_i(z_i) } = c

for a constant $c$ that does not depend on $z_i$ . We can assure this by choosing

q_i(z_i) \propto p(x_i,z_i; \theta)\, .

Because $q_i$ is a probability distribution we must enforce that $\sum_{z} q_i(z) = 1$ . Then we obtain the following derivations:

q_i(z_i) = \frac{ p(x_i, z_i; \theta) }{ \sum_{z} p(x_i, z_i, \theta) } \tag*{(standarize to sum 1)}

= \frac{ p(x_i,z_i;\theta) }{ p(x_i;\theta) } \tag*{($ z_i $ has been "marginalized out")}

= p(z_i|x_i; \theta) \tag*{(by conditional probability)}

This result implies that, when setting the $q_i$ ’s as the posterior distribution of $z_i$ conditional on $x_i$ and a general parameter value $\theta$ , the log likelihood $\ell(\theta)$ is equal to the lower bound (the tightest bound possible), making a tangential contact at the given $\theta$ . This is called the E-step. For the next stage, the M-step, we try to maximize (“push-up”) the lower bound of log likelihood $\ell(\theta)$ as stated in step 3 with respect to parameters $\theta$ in $log\, p(x_i, z_i; \theta)$ while keeping $q_i$ from the E-step. The E-step and M-step are repeated until a measure of convergence in the computation of lower bound or parameters is achieved.

In summary, EM algorithm would be implemented as:

For a joint distribution $p(X|Z; \theta)$ governed by parameters $\theta$ , we try to maximize $\ell(\theta) = log\,p(X; \theta)$ with respect to $\theta$ .

1- Choose initial values for $\theta$ .

Repeat 2 and 3 until convergence:

2- E-step: for each 𝑖

q_i(z_i) := p(z_i| x_i; \theta)

3- M-step: Set

\theta = arg\, max_{\theta} \sum_{i} \sum_{z_i} q_i(z_i) log \frac{ p(x_i,z_i;\theta)}{ q_i(z_i) }

In the E-step we use the current value or guess $\theta^t$ to find the posterior distribution of the unobserved/latent data $p(x_i,z_i;\theta)$ . Our knowledge of $Z$ is given only by this posterior distribution. In general, we cannot use the complete-data log likelihood, but we can use instead its expected value under the posterior distribution of the latent variable. In the M-step we maximize this expectation with respect to $\theta$ – as found in the complete-data $log\, p(x_i, z_i; \theta)$ - but using the expectation given by posterior $q_i(z_i) := p(z_i| x_i; \theta^t)$ found in the E-step. We can also show that, in the M-step, the maximization target function can be stated as: $\sum_{i} \sum_{z_i} q_i(z_i) log\, p(x_i,z_i;\theta)$ . It is because the denominator $q_i(z_i)$ in the log term of M-step’s target function does not depend on the $\theta$ that we try to maximize and can be considered as constant during the optimization process.

The lower bound of $\ell(\theta)$ can be also expressed as $Q(\theta, q)$ (Murphy, 2012), where:

Q(\theta,q)= \sum_{i}\mathbb{E}_{q_i}[ log\,p(x_i, z_i, \theta)] + \mathbb{H}(q_i), \\ where\, \mathbb{H}(q_i)\,is\,the\,Shannon\,entropy\,of\, q_i.

This is just a direct consequence of the properties of expectations and the log function (log a/b = log a – log b) as well as the Shannon’s entropy definition ( $\mathbb{H}(X) = - \sum_{i=1}^n p(x_i)log\,p(x_i)$ ). As stated before, $\mathcal{H}(q_i)$ does not depend on $\theta$ . Furthermore, through the steps in Annex A, Murphy (2012) decomposes the lower bound $Q(\theta, q)$ (inferred in (3)) as the sum over $i$ of the terms $\ell(\theta, q_i) = \sum_{z_i}q_i(z_i)log\, \frac{ p(x_i,z_i;\theta) }{ q_i(z_i) }$ and obtains:

L(\theta, q_i)= - \mathbb{KL}(q_i(z_i)\vert p( x | x_i, \theta)) + log\, p(x_i | \theta) \tag 4 .

We observe that log $p(x_i |\theta)$ does not depend on $q_i$ and the lower bound $L(\theta, q_i)$ is maximized when the Kullback–Leibler divergence ( $\mathbb{KL}$ ) term is zero. This implies that:

the best choice for $q_i(z_i)$ is to be set equal to $p( x | x_i, \theta)$ as the KL divergence (relative entropy; the measure of divergence between two distributions) is zero only when the both probabilities are the same (KL is positive when the probabilities are different). This is equivalent to the “ best choice of $q_i$ ” step described before.
when the $\mathbb{KL}$ term is zero, lower bound $L(\theta, q_i)$ is maximized and it is equal to log $p(x_i|\theta)$ for a given choice of $\theta$ .

For the E-step: because $\theta$ is not known, we chose an estimated $\theta^t$ and set $q_i^t(z_i) = p(x_i,z_i; \theta^t)$ . By making the $\mathbb{KL}$ divergence zero, now the lower bound $L(\theta, q_i)$ equals the function $log\, p(x_i | \theta^t)$ at $\theta^t$ . When considering the sum over $i$ we get that the lower bound of $\ell(\theta)$ denoted as $Q(q_t,q_t)$ equals $\ell(\theta^t)$ . Once the gap is zero, the next step is to “push-up” the lower bound. For the M-step: to “push-up” the lower bound, we maximize $Q(\theta, q_t) = \sum_{i} \mathbb{E}_{q_i^t}[log\,p(x_i,z_i; \theta)]$ (we can drop $\mathbb{H}(q_i^t)$ ) with respect to $\theta$ while keeping $q_i^t$ . We get a new estimate of the parameters to be used in the next E-step:

\theta^{t+1} = arg\, max_{\theta}Q(\theta, \theta^t) = arg\, max_{\theta} \sum_{i} \mathbb{E}_{q_i^t}[log\,p(x_i,z_i; \theta)]

The results are equivalent to those presented before; however, this representation helps understand the iterative process in terms of the KL divergence: when we maximize the lower bound $Q(\theta, q^t)$ in the M-step, the lower bound increases but the $\ell(\theta)$ will increase more due to the KL divergence. The reason for this is that the maximization of the lower bound uses the distribution $q_i^t$ based on $\theta^t$ to find the new $\theta^{t+1}$ . This will increase the KL divergence as the new posterior distribution $p(z_i| x_i; \theta^{t+1})$ will be different than that of $q_i^t(z_i) = p(x_i, z_i; \theta^t)$ . In this regard, unless $\ell(\theta)$ is at maximum already, the increase in $\ell(\theta)$ is at least as high as the increase of the lower bound in the M-step. The iterations are needed to until convergence.
Note: Murphy (2012) starts defining $Q(\theta, q)$ to later introduce $Q(\theta, \theta^t)$ . This is just to highlight that $q$ depends on the current values of $\theta^t$ .

Source: Murphy (p.365, 2012). The dashed red curve represents the observed-data log likelihood $\ell(\theta)$ while the blue and green lines represent, respectively, the lower bounds $Q(\theta, \theta^t)$ and $Q(\theta, \theta^{t+1})$ tangential to $\ell(\theta)$ at $\theta^t$ and $\theta^{t+1}$ . The E-step makes $Q(\theta, \theta^t)$ to touch $\ell(\theta)$ at $\theta^t$ by assigning $q_i^t(z_i)=p(x_i,z_i; \theta^t)$ . In the M-step we find $\theta^{t+1}$ by maximizing $Q(\theta, \theta^t)$ . The difference between $\ell(\theta^{t+1})$ and $Q(\theta^{t+1}, \theta^t)$ correspond to the $\mathbb{KL}$ divergence. In the next E-step we produce a new lower bound $Q(\theta^{t}, \theta^{t+1})$ by setting $q_i^{t+1}(z_i) = p(x_i, z_i; \theta^{t+1})$ . E-step and M-step are repeated until convergence.

Proof of convergence:

We would need to show that the EM iterations monotonically increases the observed data log likelihood $\ell(\theta)$ until it finds a local optimum. First, we have that, for a general $\theta$ , $\ell(\theta) \geq Q(\theta,.)$ (the lower bound is less or equal to $\ell(\theta)$ , particularly for $\theta^{t+1}$ ). Then, we have that, after the lower bound is maximized in the M-step, $Q(\theta^{t+1},\theta^t)= max_{\theta}\, Q(\theta,\theta^t) \geq Q(\theta^t,\theta^t)$ . Finally, as in E-step, $Q(\theta^t,\theta^t) = \ell(\theta^t)$ . Therefore, we have:

\ell(\theta^{t+1}) \geq Q(\theta^{t+1},\theta^t) \geq Q(\theta^t,\theta^t) = \ell(\theta^t).

Remark on coordinate ascent (Hastie et al., 2009): By defining

J(q,\theta)= \sum_{i}\sum_{z_i}q_i(z_i)log \frac{ p(x_i, z_i; \theta) }{ q_i(z_i) }.

we can observe that $\ell(\theta) \geq J(q,\theta)$ , as presented in (3). Then the EM algorithm can be understood as a coordinate ascent on J: The E-step maximizes $J$ with respect to $q$ and the M-Step maximized with respect to $\theta$ .

Properties:

Some interesting properties of EM are:

Easy implementation both analytically and computationally as well as numerical stability. Easy to program and requires small storage space.

The monotone increase of likelihood can be use for debugging. The cost by iteration is relatively low compared to other algorithms. Among it uses, it can provide estimates of missing data (missing at random) as well as to incorporating latent variable modelling (McLachlan et al., 2004).

In addition, the EM algorithm can compute also Maximum a Posterior (MAP) estimation by incorporating prior knowledge of the estimate $\theta$ .

For this, the only modification needed is in the M-step, where we would seek to maximize $Q(\theta, q^t) + log\,p(\theta)$ (Bishop, 2006).

Regarding the optimization, it is possible to the log likelihood $\ell(\theta^t)$ has multiple local maximums.

A good practice is to run the EM algorithm many times with different initial parameters for $\theta$ . The ML estimate for $\theta$ is that corresponding to the best local maximum found. Some drawbacks are (McLachlan et al., 2004):

Slow rate of convergence.
The E- or M-steps may not be analytically tractable.
The EM algorithm does not automatically provide an estimation of the parameter’s covariance matrix. Additional methodologies are needed.

Competing methods (Springer and Urban, 2014):

Newton methods for optimization can be also considered for maximum likelihood estimations. Newton methods work on any twice-differentiable function and seemingly converge faster than EM implementations. Nevertheless, Newton methods are costlier as it involves the calculation of second derivates-Hessian matrix.

Example - Mixture of Gaussians

The motivation for Gaussian mixture models (GMM) is to achieve a richer class of densities by linear superposition of Gaussian components of unknown parameters. This model can be expressed as $p(x) = \sum_{k=1}^k \pi_{k}\mathcal{N}(x|\mu_k, \sum_{k})$ with $\pi_{k}$ denoting the mixing coefficients related to K components ( $0≤\pi_{k}≤1$ and $\sum_{k}\pi_k=1$ ). The mixing component corresponds to a type of subpopulation assignment of the data. As this assignment must be learnt by the model, GMM is a form of unsupervised learning.

Nevertheless, the log likelihood of this original model is difficult to maximize. If we use a latent variable $z$ [a 1-of-K binary representation: $z_k$ is 1 if the element belongs to the k distribution and $z_{i \neq z}$ is 0 while satisfying $z_k \in$ {0,1} and $\sum_k z_k = 1$ ], we get $p(x)=\sum_z p(z)p(x|z) = \sum_{k=1}^K\pi_{k}\mathcal{N}(x|\mu_k, \sum_{k})$ with marginal distribution over $z:p(z) = \prod_{k=1}^K \pi_k^{z_k}$ (or equivalently $p(z_k = 1) = \pi_k$ ) and Gaussian conditional distribution of $x$ over a particular value of $z:p(z) = \prod_{k=1}^K \mathcal{N}(x|\mu_k, \sum_{k})^{z_k}$ . We can see that for every observed data point $x_i$ there is a latent variable $z_i$ and as equally important, now we can work with the joint distribution $p(x,z)$ in the EM algorithm. The most important realization of the latent variable version of the GMM is that the expected value of $z_{ij}$ under the posterior distribution $\mathbb{E}[z_{ij}]$ is equivalent to $q_i(z_{ij} = k) := p(z_{ij}=k|x_i;\theta)$ and can be written as:

\mathbb{E}[z_{ij}] = p(z_{ij} = k | x_i;\theta) = w_{ij} = \frac{ \pi_k \mathcal{N}(x_i|\mu_k, \sum_{k}) }{ \sum_{k=1}^K \pi_k \mathcal{N}(x_i|\mu_k, \sum_{k}) }

For the EM algorithm, we start with some guess values of $\theta$ : $\pi_j, \mu_j, \sum_j$ and iterate until convergence: E-step: calculate $p(z_{ij}=k|x_i;\theta) = w_{ij}$ . M-step: maximize with respect $\theta$ :

\sum_{i=1}^m \sum_{j=1}^k q_i(z_{ij} = k) log \frac{ p(x_i|z_{ij} = k; \mu_k, \sum_k) p(z_{ij}=k | \pi_z) }{ q_i(z_{ij} = k) }

When replacing the marginal and conditional described above, we get:

\mu_j = \frac{ \sum_{i=1}^m w_{ij}x_i }{\sum_{i=1}^m w_{ij} }, \, \pi_j = \frac{ 1 }{m}\sum_{i=1}^m w_{ij}\,and\, \sum_j = \frac{\sum_{i=1}^m w_{ij}(x_i - \mu_j)(x_i - \mu_j)^T }{\sum_{i=1}^m w_{ij} }

One consideration is to avoid solutions that collapse in singularity (infinite likelihood when $\sum_j = 0$ and $\mu_j$ equals one data point). MAP estimation with priors on $\theta$ can help with this potential problem (Bishop, 2006).

References:

Ng, Andrew. CS229 Lecture notes 8 - The EM Algorithm. 2017. http://cs229.stanford.edu/notes/cs229-notes8.pdf
Ng, Andrew. CS229 Lecture notes 7b - Mixture of Gaussians and the EM algorithm. 2017. http://cs229.stanford.edu/notes/cs229-notes7b.pdf
Murphy, Kevin P. Chapter 11 Mixture Models and the EM algorithm. Machine Learning: A Probabilistic Perspective, The MIT Press. 2012.r
Bishop, Christopher M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. 2006.
McLachlan, Geoffrey J.; Krishnan, Thriyambakam; Ng, See Ket. The EM Algorithm, Papers. Humboldt-Universität Berlin, Center for Applied Statistics and Economics (CASE), No. 24. 2004.
Hastie, Trevor; Robert Tibshirani; J H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009.
Springer, T., Urban, K. Numer Algor. Comparison of the EM algorithm and alternatives. Vol. 67, Issue 2, pp335. 2014.