#### Abstract

In this post we introduce an alternative view on Expectation Maximization using KL-divergence by Jianlin from https://kexue.fm. Unlike the common view on EM using Jensen’s inequality, the derivation of EM using KL-divergence is shorter and more intuitive.

#### Reference

Jianlin’s post (written in Chinese)

### KL divergence and maximum likelihood estimation

From our previous post about EM, we know that EM’s objective is to maximizing the expected log likelihood of the training dataset. In here, we want to expand the concept to show that such optimization is a special case of a more general form of optimization: minimizing the KL divergence. Given two probability distribution $p(x)$ and $q(x)$, KL measure how close $q(x)$ is to $p(x)$

To derive EM using the concept of KL divergence, let’s say we want to use $p_{\theta}(x)$ to approximate the underlying distribution of the training dataset $p_{true}(x)$. Given that $p_{true}(x)$ is known (as constant), minimizing the KL divergence is equivalent to maximizing the expected log likelihood of the training dataset.

where (0) is the objective of the general EM algorithm.

### KL divergence of joint probability

Incorporating the hidden variable Z that are used to explain X, we can rewrite our objective using the joint probability of X and Z

### Expectation maximization from KL divergence

To minimize (1), there are two terms we can adjust: $p_{true}(z | x)$ and $p_{\theta}(z|x)p_{\theta}(x)$. Similar to EM, we can optimize one part assuming the other part is constant and we can do this interchangeably. So, if we assume $p_{\theta}(z|x)p_{\theta}(x)$ is known, $p_{true}(z | x)$ can be adjust to minimize (1) as follow

from (2) to (E-step), we need to use the fact that ${KL}(p_{\theta}(z \vert x) \parallel p_{\theta}(z \vert x)) = 0$. For M step, we assume $p_{true}(z \vert x)$ is constant and is equal to $p_{\theta}(z \vert x)$

The above results are identical to the defintion of EM algorithm.