The notion of Divergence has huge implications in advanced Machine Learning and Neural Networks. Arguably the most influential deep learning model of the last decade, GAN (Generative Adversarial Networks), does optimization based on the idea of a divergence.

What is a divergence? You can imagine it as a sort of relative distance. Let’s say you have two probability distributions \(P\) and \(Q\). We want to measure how “distant” is \(Q\) from \(P\). A better word choice, is how deviant or “far-off” it is from \(P\). It is quite similar to a metric but with one significant difference, that it isn’t symmetric. In essence, divergence is a binary function \(D : (P , Q) \to \mathbb{R} \) or denoted as \(D(P | Q)\) which satisfies the following properties

- \(D(P | Q) \geq 0\)
- \(D(P | Q) = 0 \iff P = Q\)

That’s it. Note that when the distributions are approximately equal, the divergence approximates to 0.

One of the widely used divergence in machine learning is the Kullback-Leibler divergence ( Kullback, S.; Leibler, R.A. (1951)). It is also commonly known as relative entropy. This specific divergence is defined as follows

\[D_{KL} (P || Q) = \sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)}) \]

in discrete probability space.

How do we interpret this formulation with what we know about divergence. Consider the case when \(P(x)\) and \(Q(x)\) are the same. Since their ration comes inside the \(\log\) term, it leads to the term being \(0\) and hence the entire sum being \(0\). So we’re good as far as the second criteria is considered.

The natural question is this divergence non-negative at all times? To answer that, it is helpful to consider a larger family of such divergences known as the f-divergences

The f-divergence family is characterized as follows – given a convex function \(f : (0,\infty) \to \mathbb{R}\) satisfying \(f(1) = 0\)

\[D_{f} (P || Q) = \int{} f(\frac{p(x)}{q(x)})q(x)dx\]

We notice that if \(f(x) = x \log x\) this becomes the Kullback-Leibler divergence. So if the general family is non-negative at all times, it would imply that KL divergence would also exhibit the same property. Here we can use the Jensen’s inequality (Jensen, J. L. W. V. (1906))stated in terms of probability theory. It goes like – if \(X\) is a random variable and \(\phi\) is a convex function, then

\[ \phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]\]

If you look at it, you’ll notice that the right-hand term is essentially the integral. Moreover, since both \(P\) and \(Q\) are probability distributions which sum to 1, the left-hand term becomes \(\phi(1) = \). Hence, it is non-negative.

### References

- Kullback, S.; Leibler, R.A. (1951). “On information and sufficiency”.
*Annals of Mathematical Statistics*.**22**(1): 79–86. doi:10.1214/aoms/1177729694. JSTOR 2236703. MR 0039968. - Jensen, J. L. W. V. (1906). “Sur les fonctions convexes et les inégalités entre les valeurs moyennes”.
*Acta Mathematica*.**30**(1): 175–193. doi:10.1007/BF02418571.