In this post we introduce an alternative view on Expectation Maximization using KL-divergence by Jianlin from https://kexue.fm. Unlike the common view on EM using Jensen’s inequality, the derivation of EM using KL-divergence is shorter and more intuitive.


Jianlin’s post (written in Chinese)

KL divergence and maximum likelihood estimation

From our previous post about EM, we know that EM’s objective is to maximizing the expected log likelihood of the training dataset. In here, we want to expand the concept to show that such optimization is a special case of a more general form of optimization: minimizing the KL divergence. Given two probability distribution and , KL measure how close is to

To derive EM using the concept of KL divergence, let’s say we want to use to approximate the underlying distribution of the training dataset . Given that is known (as constant), minimizing the KL divergence is equivalent to maximizing the expected log likelihood of the training dataset.

where (0) is the objective of the general EM algorithm.

KL divergence of joint probability

Incorporating the hidden variable Z that are used to explain X, we can rewrite our objective using the joint probability of X and Z

Expectation maximization from KL divergence

To minimize (1), there are two terms we can adjust: and . Similar to EM, we can optimize one part assuming the other part is constant and we can do this interchangeably. So, if we assume is known, can be adjust to minimize (1) as follow

from (2) to (E-step), we need to use the fact that . For M step, we assume is constant and is equal to

The above results are identical to the defintion of EM algorithm.