In this post we introduce a common view on Expectation Maximization using Jensen’s inequality.


Andrew’s note

EM and maximum likelihood estimation

The goal of EM is to maximize the likelihood of observed variable X explained by hidden variable Z under a model with parameters . Noted that when Z is known (usually as labels), this problem becomes a supervised learning problem and it can be solved directly using maximum likelihood estimaation instead of EM. For a training dataset with size , the expected log likelihood of a dataset with instances , , … drawn from distribution is given by:

Our objective is to maximize (0) by adjusting

Derivation from Jensen’s inequality

Starting from (1), derivation using Jensen’s inequality adds a “helper” distribution such that

(2) is greater than or equal to (3) because Jensen’s inequaitliy states that if is a concave function (for more details, look at Andrew’s note or wikipedia). In our case, is which is a concave function and is . Using (3) as a lower bound of (2), we want to raise (3) as close as possible to (2) such that (3) (2) to maximize the likelihood. For this, Jensen’s inequality also states that when is constant, . Therefore, we can construct such that Y is equal to some constant c as follow

If we assume each instance has the same corresponding constant c, then we have

Given is a probability distribution and , we have

Therefore, can be written as follow

Expectation Maximization

(7) is our E step

Using (7) to replace in (3), we obtain our M step assuming is constant


For (5), it seems this assumption is necessary but ignored from Andrew’s note and some other sources. How does such assumption affect the construction of ?