In this post we introduce a common view on Expectation Maximization using Jensen’s inequality.
EM and maximum likelihood estimation
The goal of EM is to maximize the likelihood of observed variable X explained by hidden variable Z under a model with parameters . Noted that when Z is known (usually as labels), this problem becomes a supervised learning problem and it can be solved directly using maximum likelihood estimaation instead of EM. For a training dataset with size , the expected log likelihood of a dataset with instances , , … drawn from distribution is given by:
Our objective is to maximize (0) by adjusting
Derivation from Jensen’s inequality
Starting from (1), derivation using Jensen’s inequality adds a “helper” distribution such that
(2) is greater than or equal to (3) because Jensen’s inequaitliy states that if is a concave function (for more details, look at Andrew’s note or wikipedia). In our case, is which is a concave function and is . Using (3) as a lower bound of (2), we want to raise (3) as close as possible to (2) such that (3) (2) to maximize the likelihood. For this, Jensen’s inequality also states that when is constant, . Therefore, we can construct such that Y is equal to some constant c as follow
If we assume each instance has the same corresponding constant c, then we have
Given is a probability distribution and , we have
Therefore, can be written as follow
(7) is our E step
Using (7) to replace in (3), we obtain our M step assuming is constant
For (5), it seems this assumption is necessary but ignored from Andrew’s note and some other sources. How does such assumption affect the construction of ?