#### Abstract

In this post we introduce a common view on Expectation Maximization using Jensen’s inequality.

Andrew’s note

### EM and maximum likelihood estimation

The goal of EM is to maximize the likelihood of observed variable X explained by hidden variable Z under a model with parameters $\theta$. Noted that when Z is known (usually as labels), this problem becomes a supervised learning problem and it can be solved directly using maximum likelihood estimaation instead of EM. For a training dataset with size $n$, the expected log likelihood of a dataset with instances $x_1$, $x_2$, … $x_n$ drawn from distribution $p_{true}(x)$ is given by:

Our objective is to maximize (0) by adjusting $\theta$

### Derivation from Jensen’s inequality

Starting from (1), derivation using Jensen’s inequality adds a “helper” distribution $q$ such that

(2) is greater than or equal to (3) because Jensen’s inequaitliy states that $f(E[Y]) \geq E[f(Y)]$ if $f$ is a concave function (for more details, look at Andrew’s note or wikipedia). In our case, $f$ is $log$ which is a concave function and $Y$ is $\frac{p(x_i , z_i; \theta)}{q(z_i)}$. Using (3) as a lower bound of (2), we want to raise (3) as close as possible to (2) such that (3) $=$ (2) to maximize the likelihood. For this, Jensen’s inequality also states that when $Y$ is constant, $f(E[Y]) = E[f(Y)]$. Therefore, we can construct $q$ such that Y is equal to some constant c as follow

If we assume each instance has the same corresponding constant c, then we have

Given $q(z_i)$ is a probability distribution and $\sum_{z_i} q(z_i) = 1$, we have

Therefore, $q(z_i)$ can be written as follow

### Expectation Maximization

(7) is our E step

Using (7) to replace $q(z_i)$ in (3), we obtain our M step assuming $p_{\theta}(x_i \vert z_i)$ is constant

For (5), it seems this assumption is necessary but ignored from Andrew’s note and some other sources. How does such assumption affect the construction of $q(z_i)$?