Invitation to Information Theory

(pure-math oriented)

Yes, Information Theory is a discipline in applied math and theoretical CS. So, pure-math people (like me, sometimes) might ignore the theory. However, from a lecture given by Prof. Gohari (at ITCSC, CUHK), I'm surprised by the application of information theory onto pure mathematics. Note that this is not so deep, and not so number theoretic, but yet still beautiful in some sense.

Furthermore, I thank Prof. Gohari and Prof. Nair (both at ITCSC, CUHK) for the lectures on information theory, and I thank Prof. Fakcharoenphol (at KU, in Thailand) who recommended me the ITCSC Summer Research Program.

I also used the following notes as references (for my self-study and better understanding of the subject).

Intersection between Information Theory and Combinatorics (by Stoyan Dimitrov)
Supplementary lecture notes (provided by Chandra Nair with Devon Ding as a scriber)

Now, let's dig in!

Problem. Given a set of $n$ distinct points in $\mathbb{R}^3$ . We project the points onto the $XY$ -plane, $XZ$ -plane, and $YZ$ -plane, one by one. Let $n_{XY}, n_{XZ}, n_{YZ}$ be the number of distinct projected points (we identify the duplicates). Prove that $n_{XY}n_{YZ}n_{XZ} \geq n^2$ .

Example. The following figures shows the case of the points $\{(0.2, 0.3, 0.8), (0.5, 0.7, 0.2), (0.4, 0.4, 0.8), (0.6, 0.7, 0.5), (0.4, 0.2, 0.8), (0.2, 0.3, 0.9)\}$ denoting in black xs, and their projections in red, green, and blue. In this case, we have $n_{XY} = 5$ , $n_{XZ} = 5$ , $n_{YZ} = 6$ and $n = 6$ .

So we have $150 = n_{XY}n_{YZ}n_{XZ} \geq n^2 = 36.$

You can pause and ponder before continuing to the rest of the article.

Information

Suppose we have a fair coin, giving either H or T, with probability $\frac{1}{2}$ each. What is the minimum number of bits needed to convey the information upon knowing a coin flip?

The easiest way is to let H be represented by the bit 0 and T by the bit 1. This gives one bit per one coin flip.

However, if we have a biased coin, giving H with probability $\frac{1}{100}$ and T with probability $\frac{99}{100}$ . What would be the minimum number of bits needed to convey the information?

Well, we observe that most of the time, the result is T. So we might say, let's encode TT as 0, TH as 100, HT as 101, and HH as 11.

Then, once we see TTHTTTTTTT, we only send 0101000, which has only $7$ bits for $10$ coin flips.

The problem is, in average, what is the number of bits required?

It turns out that we can use a different coding scheme to encode the result, and so we come up with the entropy.

Definition. (Entropy) For a discrete random variable $X \colon \Omega \to E$ (where $E$ is countable) in a discrete probability space $(\Omega, \mathbb{P})$ , we define the entropy of $X$ , denoted $H(X)$ , by $H(X) = \sum_{\substack{x \in E\\p_X(x) \ne 0}} p_X(x) \log_2 \left(\frac{1}{p_X(x)}\right)$ where $p_X(x) = \mathbb{P}(X = x)$ is the probability mass function of $\mathbb{P}$ on the random variable $X$ .

Intuitively, entropy tells you the average information gained by knowing a realization of the random variable.

Information from multiple things

Suppose $X$ and $Y$ are random variables in the same probability space. Then we may define $p_{XY}$ to be the joint probability mass function, i.e., $p_{XY}(x, y) = \mathbb{P}(X = x, Y = y)$ .

If $X$ and $Y$ are independent, then $p_{XY}(x, y) = p_X(x)p_Y(y)$ . But this is not true in general! However, observe that $p_X(x)$ can always be expressed as $p_X(x) = \sum_{y \in Y(\Omega)} p_{XY}(x, y).$ This is because, intuitively, the probability that we draw $X$ to be $x$ is the sum of the probability of all the scenarios whether $y$ is anything.

Now, we extend the study of the basic joint probability to the study of joint entropy.

Defintion. (Joint entropy) $H(X,Y) = \sum_{\substack{x \in X(\Omega)\\y \in Y(\Omega)\\ p_{XY}(x, y) \ne 0}} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x, y)}\right)$

Now, an interesting question is, what is the relationship between knowing $X$ alone, knowing $Y$ alone, and knowing both $X$ and $Y$ ?

If we want to convey the information about $(X,Y)$ . One direct way is to convey the information about $X$ first, followed by the information about $Y$ . This intuitively means $H(X,Y) \leq H(X) + H(Y)$ , i.e., if $X$ has some correlation to $Y$ , then we might come up with a better way to send less entropy than $H(X) + H(Y)$ , but otherwise, we could just send $X$ first and then $Y$ , which takes the entropy of $H(X) + H(Y)$ . Let us present an actual proof next.

Proposition. For any random variable $X$ and $Y$ on the same discrete probability space,

$H(X, Y) \leq H(X) + H(Y).$

Proof. (incomplete) Let us expand the formula for $H$ . We now want to prove that $\sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x,y)}\right) \leq \sum_x p_X(x)\log_2\left(\frac{1}{p_X(x)}\right) + \sum_y p_Y(y)\log_2\left(\frac{1}{p_Y(y)}\right)$ But $p_X(x) = \sum_y p_{XY}(x, y)$ and $p_Y(y) = \sum_x p_{XY}(x, y)$ so we actually want to prove $\sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x,y)}\right) \leq \sum_{x, y} p_{XY}(x, y)\left(\log_2\left(\frac{1}{p_X(x)}\right)+\log_2\left(\frac{1}{p_Y(y)}\right)\right)$ which is $0 \leq \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\right).$ Now, we define the quantity $\sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\right)$ as $D(p_{XY} \parallel p_X p_Y)$ . And we'll come back to this later.

Ok. This looks like we're just complicating our lives. However, this $D(P \parallel Q)$ is another quantity, interesting of itself. We call this the Kullback-Liebler divergence or KL divergence.

Kullback-Liebler divergence

Definition. (KL divergence) For finite sequences $P, Q \in \mathbb{R}^n$ (such that $Q_i = 0$ implies $P_i = 0$ , and $\sum_{i=1}^n P_i = \sum_{i=1}^n Q_i = 1$ ) we denote by $D(P \parallel Q)$ the quantity $\sum_{\substack{i=1\\Q_i \ne 0}}^n P_i \log_2\left(\frac{P_i}{Q_i}\right)$ and for random variables $P, Q \colon \Omega \to E$ with countable $E$ such that "for any $y \in E$ with $p_Q(y) = 0$ , $p_P(y) = 0$ also", we denote by $D(P \parallel Q)$ or $D(p_P \parallel p_Q)$ the quantity $\sum_{\substack{x \in P(\Omega) \cup Q(\Omega)\\p_Q(x) \ne 0}} p_P(x) \log_2\left(\frac{p_P(x)}{p_Q(x)}\right)$

This scary notion of $D(P \parallel Q)$ looks overly complicated and useless. However, intuitively it is very useful. The intuition is, if we know $P$ as a template distribution, and now someone give us $Q$ , what is the deviation of $Q$ from $P$ ? This is precisely what $D(P \parallel Q)$ answers us. Note that I watched this video to learn about this.

For our purpose now, what we want to prove is the following proposition.

Proposition. For any $P, Q$ , $D(P \parallel Q) \geq 0$ .

Proof. Recall the fact from basic analysis that $1+x \leq e^x$ for all $x \in \mathbb{R}$ . (Analyzing the minimum of $x \mapsto e^x - x - 1$ which is obtained at $x = 0$ , which means $e^x - x - 1 \geq e^0 - 0 - 1 = 0$ for all $x \in \mathbb{R}$ , this completes the proof.) Now, we take the logarithm on both sides, and we have $\log_2(1+x) \leq x$ for all $x > -1$ , or equivalently $\log_2(x) \leq x-1$ for all $x > 0$ . Now replace $x$ with $\frac{1}{x}$ , then $\log_2\left(\frac{1}{x}\right) \leq \frac{1}{x}-1$ , that is, $\log_2(x) \geq 1-\frac{1}{x}$ for all $x > 0$ . Now, $D(P \parallel Q) = \sum_{i} P_i \log_2\left(\frac{P_i}{Q_i}\right) \geq \sum_{i} P_i \left(1 - \frac{Q_i}{P_i}\right) = \sum_i P_i - Q_i = 1 - 1 = 0.$ This completes the proof. $\square$

Now, back to our proposition:

Proposition. For any random variable $X$ and $Y$ on the same discrete probability space,

$H(X, Y) \leq H(X) + H(Y).$

Recall that the statement is equivalent to $0 \leq D(p_{XY} \parallel p_X p_Y)$ . Now apply the previous proposition that $D(P \parallel Q) \geq 0$ for any $P, Q$ and this completes the proof. $\square$

Mutual information

Definition. (Mutual information) We define $I(X;Y)$ to be $H(X) + H(Y) - H(X, Y)$ . Of course, since we've just proved $H(X, Y) \leq H(X) + H(Y)$ , this means $I(X;Y) \geq 0$ always.

What does this mean? As the name suggests, sometimes it's better to convey the information of both $X$ and $Y$ since it would take the entropy of only $H(X,Y)$ instead of $H(X) + H(Y)$ . The advantage (marginal benefit) of doing this is exactly $H(X) + H(Y) - H(X, Y)$ , so we call this "mutual information".

Before going forward, there's another scenario. If someone wants to convey us the information of $X$ and $Y$ , they would need the entropy $H(X, Y)$ . But what if we already know that the realization of $Y$ is $y$ , what would be the remaining entropy? We denote this by $H(X | Y = y)$ . (The actual definition is down below, but what we're thinking should be intuitively obvious in this case.)

Question. Is it true that $H(X | Y = y) \leq H(X)$ for any $y \in Y(\Omega)$ ?

Answer. In general, no. Consider the following example

$X$	$Y$	$p_{XY}(x,y)$
$0$	$0$	$\frac{1}{6}$
$0$	$1$	$0$
$1$	$0$	$\frac{1}{6}$
$1$	$1$	$\frac{2}{3}$

$H(X|Y = 0)$ is $1$ . ( $X$ now has probability $\frac{1}{2}$ giving $0$ and $\frac{1}{2}$ giving $1$ .) Now what about $H(X)$ ? ( $X$ now has probability $\frac{1}{6}$ giving $0$ and probability $\frac{5}{6}$ giving $1$ ) Hence the entropy $H(X)$ is $\frac{1}{6}\log_2(6) + \frac{5}{6}\log_2\left(\frac{6}{5}\right) \approx 0.65$ , which is less than $1$ .

However, a very nice observation is that $H(X|Y) \leq H(X)$ in general! (where $H(X|Y)$ is the average of $H(X|Y = y)$ for all $y$ . Consider the following definition.

Defintion. (Conditional entropy)

$H(X|Y) = \sum_{y \in Y(\Omega)} p_Y(y) H(X|Y=y)$ where $H(X|Y=y) = \sum_{x \in X(\Omega)} p_{X|Y}(x|y) \log_2\left(\frac{1}{p_{X|Y}(x|y)}\right)$ where $p_{X|Y}(x|y)$ is the conditional probability $\mathbb{P}(X = x | Y = y)$ defined by $\frac{\mathbb{P}(X = x, Y = y)}{\mathbb{P}(Y = y)} = \frac{p_{XY}(x,y)}{p_Y(y)}$

Now, expanding $H(X|Y)$ , we have $H(X|Y) = \sum_{y \in Y(\Omega)}p_Y(y) \sum_{x \in X(\Omega)}p_{X|Y}(x|y) \log_2\left(\frac{1}{p_{X|Y}(x|y)}\right)$ which, since $p_Y(y)p_{X|Y}(x|y) = p_{XY}(x,y)$ , this is $H(X|Y) = -\sum_{x, y} p_{XY}(x,y) \log_2(p_{X|Y}(x|y))$

Now, let us consider another important proposition.

Proposition. $H(X|Y) \leq H(X)$ in general.

Proof. We want to show that $H(X) - H(X|Y) \geq 0$ . Consider that $H(X) - H(X|Y) = \sum_{x, y} p_{XY}(x, y) \left(\log_2\left(\frac{1}{p_X(x)}\right) + \log_2(p_{X|Y}(x|y))\right)\\ = \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{X|Y}(x|y)}{p_X(x)}\right) \\ = \sum_{x, y} p_{XY}\log_2\left(\frac{p_{X,Y}(x,y)}{p_X(x)p_Y(y)}\right) \\ = D(p_{XY} \parallel p_X p_Y) \geq 0. \qquad \square$ Actually, we've proved more than just $H(X|Y) \leq H(X)$ . We've shown that $H(X) - H(X|Y) = D(p_{XY} \parallel p_X p_Y)$ . But recall that this quantity is actually $H(X) + H(Y) - H(X, Y)$ , which is also $I(X;Y)$ . So we have another nice representation of $I(X;Y)$ .

Corollary. (proved) The following are equal.

$I(X;Y)$
$H(X)+H(Y)-H(X,Y)$
$D(p_{XY} \parallel p_X p_Y)$
$H(X) - H(X|Y)$

Also, since $H(X) + H(Y) - H(X,Y) = H(X) - H(X|Y)$ , we have $H(X|Y) = H(X,Y) - H(Y)$ .

So far, we've introduced a lot of basic concepts from information theory. But the actual lemma we want to consider is coming next.

The lemma

Lemma. $H(X,Y) + H(Y,Z) + H(X,Z) \geq 2H(X,Y,Z)$ .

This is a generalization of the inequality $H(X) + H(Y) \geq H(X,Y)$ to the case of three variables, and it is also a special case of Shearer's inequality (beyond the scope of what I know). Let us prove this.

Proof. We want to show that $H(X,Y) + H(Y,Z) + H(X,Z) - 2H(X,Y,Z) \geq 0$ . Consider that $H(X,Y) + H(Z) - H(X,Y,Z) = I(X,Y;Z)$ and $H(Y,Z) + H(X) - H(X,Y,Z) = I(X;Y,Z)$ . This means we actually want to prove $I(X,Y;Z)-H(Z) + I(X;Y,Z)-H(X) + H(X,Z) \geq 0.$ But furthermore, $H(X) + H(Z) - H(X,Z) = I(X;Z)$ so what we really want to prove is $I(X,Y;Z) + I(X;Y,Z) \geq I(X;Z).$ This is now intuitive. The mutual information between knowing $X,Y$ and $Z$ , and between $X$ and $Y,Z$ should be more than between $X$ and $Z$ . In fact, just $I(X;Y,Z)$ should already be greater than or equal to $I(X;Z)$ . We rewrite $I(X;Y,Z) - I(X;Z) = H(X) + H(Y,Z) - H(X,Y,Z) - (H(X) + H(Z) - H(X,Z)) \\ = H(Y,Z) - H(Z) + H(X,Z) - H(X,Y,Z)\\ = H(Y|Z) - H(Y|X,Z) \geq 0$ and this completes the proof. $\square$

Actually, we need one more final ingredient. Consider the following theorem.

Theorem. $H(X) \leq \log_2(|X(\Omega)|)$ .

Proof. $H(X) = \sum_{x \in X(\Omega)} p_X(x) \log_2\left(\frac{1}{p_X(x)}\right) \\ = \mathbb{E} \left[\log_2\left(\frac{1}{p_X(x)}\right)\right]\\ \leq \underbrace{\log_2 \mathbb{E}}_{\text{using Jensen's inequality}} \left[\frac{1}{p_X(x)}\right] \\ = \log_2 |X(\Omega)| \qquad \square$

The Problem

Recall our problem from the beginning.

Observe that the statement $n_{XY}n_{YZ}n_{XZ} \geq n^2$ is very similar to $H(X,Y) + H(Y,Z) + H(X, Z) \geq 2H(X,Y,Z)$ . How about we take the logarithm of both sides? We now want to prove that $\log_2 n_{XY} + \log_2 n_{YZ} + \log_2 n_{XZ} \geq 2 \log_2 n.$

Consider the list of given $n$ points. Let us write them as $(x_i, y_i, z_i)$ for $i \in \{1, \dots, n\}$ . Define a probability space $\Omega = \{1, \dots, n\}$ and $X \colon \Omega \to \{x_1, \dots, x_n\}$ given by $X(i) = x_i$ . We do the same for $Y$ and $Z$ . And now let us equip $\Omega$ with $\mathbb{P}$ defined as the uniformly random probability distribution. (i.e. $\mathbb{P}(\{i\}) = \frac{1}{n}$ for all $i \in \{1, \dots, n\}$ .) Now we can apply $H(X,Y) + H(Y,Z) + H(X,Z) \geq 2H(X,Y,Z)$ directly. Before that, apply the last theorem, we have $H(X, Y) \leq \log_2(n_{XY})$ , $H(Y,Z) \leq \log_2(n_{YZ})$ and $H(X,Z) \leq \log_2(n_{XZ})$ . Now expand $H(X,Y,Z) = \sum_{x,y,z} p_{XYZ}(x,y,z) \log_2\left(\frac{1}{p_{XYZ}(x,y,z)}\right) = \sum_{1}^n \frac{1}{n} \log_2(n) = \log_2(n).$ Hence, we have $\log_2(n_{XY}) + \log_2(n_{YZ}) + \log_2(n_{XZ}) \geq H(X,Y)+H(Y,Z)+H(X,Z) \geq 2H(X,Y,Z) = 2\log_2(n).$ So, by exponentiating both sides, we have $n_{XY}n_{YZ}n_{XZ} \geq n^2$ and hence completes the proof. $\square$

Final thoughts

One may argue that the tools given here (from information theory) is just a "shorthand" or "syntactic sugar" for other elementary tools. While that is somehow legit, I'd say that other deep mathematics are also just a sequence of shorthands also. Information theory provides the notation and basic results that can also be interpreted intuitively, i.e., $H(X)$ as the amount of information needed to convey the result of $X$ , $D(P \parallel Q)$ , etc. And these can be quite useful in purer or more theoretical parts of mathematics and computer science where we might lack other direct intuitions to solve them.

All the above content is just a summary of three one-hour-lectures. And so there are a lot more to learn in information theory. I hope I will have the chance to study more on information theory later when I have time. And I hope this article might give the reader some insight on information theory.