Invitation to Information Theory

(pure-math oriented)

Yes, Information Theory is a discipline in applied math and theoretical CS. So, pure-math people (like me, sometimes) might ignore the theory. However, from a lecture given by Prof. Gohari (at ITCSC, CUHK), I'm surprised by the application of information theory onto pure mathematics. Note that this is not so deep, and not so number theoretic, but yet still beautiful in some sense.

Furthermore, I thank Prof. Gohari and Prof. Nair (both at ITCSC, CUHK) for the lectures on information theory, and I thank Prof. Fakcharoenphol (at KU, in Thailand) who recommended me the ITCSC Summer Research Program.

I also used the following notes as references (for my self-study and better understanding of the subject).

Now, let's dig in!

Problem. Given a set of nn distinct points in R3\mathbb{R}^3. We project the points onto the XYXY-plane, XZXZ-plane, and YZYZ-plane, one by one. Let nXY,nXZ,nYZn_{XY}, n_{XZ}, n_{YZ} be the number of distinct projected points (we identify the duplicates). Prove that nXYnYZnXZn2n_{XY}n_{YZ}n_{XZ} \geq n^2.

Example. The following figures shows the case of the points {(0.2,0.3,0.8),(0.5,0.7,0.2),(0.4,0.4,0.8),(0.6,0.7,0.5),(0.4,0.2,0.8),(0.2,0.3,0.9)} \{(0.2, 0.3, 0.8), (0.5, 0.7, 0.2), (0.4, 0.4, 0.8), (0.6, 0.7, 0.5), (0.4, 0.2, 0.8), (0.2, 0.3, 0.9)\} denoting in black xs, and their projections in red, green, and blue. In this case, we have nXY=5n_{XY} = 5, nXZ=5n_{XZ} = 5, nYZ=6n_{YZ} = 6 and n=6n = 6.

So we have 150=nXYnYZnXZn2=36. 150 = n_{XY}n_{YZ}n_{XZ} \geq n^2 = 36.

You can pause and ponder before continuing to the rest of the article.

Information

Suppose we have a fair coin, giving either H or T, with probability 12\frac{1}{2} each. What is the minimum number of bits needed to convey the information upon knowing a coin flip?

The easiest way is to let H be represented by the bit 0 and T by the bit 1. This gives one bit per one coin flip.

However, if we have a biased coin, giving H with probability 1100\frac{1}{100} and T with probability 99100\frac{99}{100}. What would be the minimum number of bits needed to convey the information?

Well, we observe that most of the time, the result is T. So we might say, let's encode TT as 0, TH as 100, HT as 101, and HH as 11.

Then, once we see TTHTTTTTTT, we only send 0101000, which has only 77 bits for 1010 coin flips.

The problem is, in average, what is the number of bits required?

It turns out that we can use a different coding scheme to encode the result, and so we come up with the entropy.

Definition. (Entropy) For a discrete random variable X ⁣:ΩEX \colon \Omega \to E (where EE is countable) in a discrete probability space (Ω,P)(\Omega, \mathbb{P}), we define the entropy of XX, denoted H(X)H(X), by H(X)=xEpX(x)0pX(x)log2(1pX(x)) H(X) = \sum_{\substack{x \in E\\p_X(x) \ne 0}} p_X(x) \log_2 \left(\frac{1}{p_X(x)}\right) where pX(x)=P(X=x)p_X(x) = \mathbb{P}(X = x) is the probability mass function of P\mathbb{P} on the random variable XX.

Intuitively, entropy tells you the average information gained by knowing a realization of the random variable.

Information from multiple things

Suppose XX and YY are random variables in the same probability space. Then we may define pXYp_{XY} to be the joint probability mass function, i.e., pXY(x,y)=P(X=x,Y=y)p_{XY}(x, y) = \mathbb{P}(X = x, Y = y).

If XX and YY are independent, then pXY(x,y)=pX(x)pY(y)p_{XY}(x, y) = p_X(x)p_Y(y). But this is not true in general! However, observe that pX(x)p_X(x) can always be expressed as pX(x)=yY(Ω)pXY(x,y). p_X(x) = \sum_{y \in Y(\Omega)} p_{XY}(x, y). This is because, intuitively, the probability that we draw XX to be xx is the sum of the probability of all the scenarios whether yy is anything.

Now, we extend the study of the basic joint probability to the study of joint entropy.

Defintion. (Joint entropy) H(X,Y)=xX(Ω)yY(Ω)pXY(x,y)0pXY(x,y)log2(1pXY(x,y)) H(X,Y) = \sum_{\substack{x \in X(\Omega)\\y \in Y(\Omega)\\ p_{XY}(x, y) \ne 0}} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x, y)}\right)

Now, an interesting question is, what is the relationship between knowing XX alone, knowing YY alone, and knowing both XX and YY?

If we want to convey the information about (X,Y)(X,Y). One direct way is to convey the information about XX first, followed by the information about YY. This intuitively means H(X,Y)H(X)+H(Y)H(X,Y) \leq H(X) + H(Y), i.e., if XX has some correlation to YY, then we might come up with a better way to send less entropy than H(X)+H(Y)H(X) + H(Y), but otherwise, we could just send XX first and then YY, which takes the entropy of H(X)+H(Y)H(X) + H(Y). Let us present an actual proof next.

Proposition. For any random variable XX and YY on the same discrete probability space,

H(X,Y)H(X)+H(Y). H(X, Y) \leq H(X) + H(Y).

Proof. (incomplete) Let us expand the formula for HH. We now want to prove that x,ypXY(x,y)log2(1pXY(x,y))xpX(x)log2(1pX(x))+ypY(y)log2(1pY(y)) \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x,y)}\right) \leq \sum_x p_X(x)\log_2\left(\frac{1}{p_X(x)}\right) + \sum_y p_Y(y)\log_2\left(\frac{1}{p_Y(y)}\right) But pX(x)=ypXY(x,y)p_X(x) = \sum_y p_{XY}(x, y) and pY(y)=xpXY(x,y)p_Y(y) = \sum_x p_{XY}(x, y) so we actually want to prove x,ypXY(x,y)log2(1pXY(x,y))x,ypXY(x,y)(log2(1pX(x))+log2(1pY(y))) \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{1}{p_{XY}(x,y)}\right) \leq \sum_{x, y} p_{XY}(x, y)\left(\log_2\left(\frac{1}{p_X(x)}\right)+\log_2\left(\frac{1}{p_Y(y)}\right)\right) which is 0x,ypXY(x,y)log2(pXY(x,y)pX(x)pY(y)). 0 \leq \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\right). Now, we define the quantity x,ypXY(x,y)log2(pXY(x,y)pX(x)pY(y)) \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\right) as D(pXYpXpY)D(p_{XY} \parallel p_X p_Y). And we'll come back to this later.

Ok. This looks like we're just complicating our lives. However, this D(PQ)D(P \parallel Q) is another quantity, interesting of itself. We call this the Kullback-Liebler divergence or KL divergence.

Kullback-Liebler divergence

Definition. (KL divergence) For finite sequences P,QRnP, Q \in \mathbb{R}^n (such that Qi=0Q_i = 0 implies Pi=0P_i = 0, and i=1nPi=i=1nQi=1\sum_{i=1}^n P_i = \sum_{i=1}^n Q_i = 1) we denote by D(PQ)D(P \parallel Q) the quantity i=1Qi0nPilog2(PiQi) \sum_{\substack{i=1\\Q_i \ne 0}}^n P_i \log_2\left(\frac{P_i}{Q_i}\right) and for random variables P,Q ⁣:ΩEP, Q \colon \Omega \to E with countable EE such that "for any yEy \in E with pQ(y)=0p_Q(y) = 0, pP(y)=0p_P(y) = 0 also", we denote by D(PQ)D(P \parallel Q) or D(pPpQ)D(p_P \parallel p_Q) the quantity xP(Ω)Q(Ω)pQ(x)0pP(x)log2(pP(x)pQ(x)) \sum_{\substack{x \in P(\Omega) \cup Q(\Omega)\\p_Q(x) \ne 0}} p_P(x) \log_2\left(\frac{p_P(x)}{p_Q(x)}\right)

This scary notion of D(PQ)D(P \parallel Q) looks overly complicated and useless. However, intuitively it is very useful. The intuition is, if we know PP as a template distribution, and now someone give us QQ, what is the deviation of QQ from PP? This is precisely what D(PQ)D(P \parallel Q) answers us. Note that I watched this video to learn about this.

For our purpose now, what we want to prove is the following proposition.

Proposition. For any P,QP, Q, D(PQ)0D(P \parallel Q) \geq 0.

Proof. Recall the fact from basic analysis that 1+xex1+x \leq e^x for all xRx \in \mathbb{R}. (Analyzing the minimum of xexx1x \mapsto e^x - x - 1 which is obtained at x=0x = 0, which means exx1e001=0e^x - x - 1 \geq e^0 - 0 - 1 = 0 for all xRx \in \mathbb{R}, this completes the proof.) Now, we take the logarithm on both sides, and we have log2(1+x)x\log_2(1+x) \leq x for all x>1x > -1, or equivalently log2(x)x1\log_2(x) \leq x-1 for all x>0x > 0. Now replace xx with 1x\frac{1}{x}, then log2(1x)1x1\log_2\left(\frac{1}{x}\right) \leq \frac{1}{x}-1, that is, log2(x)11x\log_2(x) \geq 1-\frac{1}{x} for all x>0x > 0. Now, D(PQ)=iPilog2(PiQi)iPi(1QiPi)=iPiQi=11=0. D(P \parallel Q) = \sum_{i} P_i \log_2\left(\frac{P_i}{Q_i}\right) \geq \sum_{i} P_i \left(1 - \frac{Q_i}{P_i}\right) = \sum_i P_i - Q_i = 1 - 1 = 0. This completes the proof. \square

Now, back to our proposition:

Proposition. For any random variable XX and YY on the same discrete probability space,

H(X,Y)H(X)+H(Y). H(X, Y) \leq H(X) + H(Y).

Recall that the statement is equivalent to 0D(pXYpXpY)0 \leq D(p_{XY} \parallel p_X p_Y). Now apply the previous proposition that D(PQ)0D(P \parallel Q) \geq 0 for any P,QP, Q and this completes the proof. \square

Mutual information

Definition. (Mutual information) We define I(X;Y)I(X;Y) to be H(X)+H(Y)H(X,Y)H(X) + H(Y) - H(X, Y). Of course, since we've just proved H(X,Y)H(X)+H(Y)H(X, Y) \leq H(X) + H(Y), this means I(X;Y)0I(X;Y) \geq 0 always.

What does this mean? As the name suggests, sometimes it's better to convey the information of both XX and YY since it would take the entropy of only H(X,Y)H(X,Y) instead of H(X)+H(Y)H(X) + H(Y). The advantage (marginal benefit) of doing this is exactly H(X)+H(Y)H(X,Y)H(X) + H(Y) - H(X, Y), so we call this "mutual information".

Before going forward, there's another scenario. If someone wants to convey us the information of XX and YY, they would need the entropy H(X,Y)H(X, Y). But what if we already know that the realization of YY is yy, what would be the remaining entropy? We denote this by H(XY=y)H(X | Y = y). (The actual definition is down below, but what we're thinking should be intuitively obvious in this case.)

Question. Is it true that H(XY=y)H(X)H(X | Y = y) \leq H(X) for any yY(Ω)y \in Y(\Omega)?

Answer. In general, no. Consider the following example

XX YY pXY(x,y)p_{XY}(x,y)
00 00 16\frac{1}{6}
00 11 00
11 00 16\frac{1}{6}
11 11 23\frac{2}{3}

H(XY=0)H(X|Y = 0) is 11. (XX now has probability 12\frac{1}{2} giving 00 and 12\frac{1}{2} giving 11.) Now what about H(X)H(X)? (XX now has probability 16\frac{1}{6} giving 00 and probability 56\frac{5}{6} giving 11) Hence the entropy H(X)H(X) is 16log2(6)+56log2(65)0.65\frac{1}{6}\log_2(6) + \frac{5}{6}\log_2\left(\frac{6}{5}\right) \approx 0.65, which is less than 11.

However, a very nice observation is that H(XY)H(X)H(X|Y) \leq H(X) in general! (where H(XY)H(X|Y) is the average of H(XY=y)H(X|Y = y) for all yy. Consider the following definition.

Defintion. (Conditional entropy)

H(XY)=yY(Ω)pY(y)H(XY=y) H(X|Y) = \sum_{y \in Y(\Omega)} p_Y(y) H(X|Y=y) where H(XY=y)=xX(Ω)pXY(xy)log2(1pXY(xy)) H(X|Y=y) = \sum_{x \in X(\Omega)} p_{X|Y}(x|y) \log_2\left(\frac{1}{p_{X|Y}(x|y)}\right) where pXY(xy)p_{X|Y}(x|y) is the conditional probability P(X=xY=y)\mathbb{P}(X = x | Y = y) defined by P(X=x,Y=y)P(Y=y)=pXY(x,y)pY(y)\frac{\mathbb{P}(X = x, Y = y)}{\mathbb{P}(Y = y)} = \frac{p_{XY}(x,y)}{p_Y(y)}

Now, expanding H(XY)H(X|Y), we have H(XY)=yY(Ω)pY(y)xX(Ω)pXY(xy)log2(1pXY(xy)) H(X|Y) = \sum_{y \in Y(\Omega)}p_Y(y) \sum_{x \in X(\Omega)}p_{X|Y}(x|y) \log_2\left(\frac{1}{p_{X|Y}(x|y)}\right) which, since pY(y)pXY(xy)=pXY(x,y)p_Y(y)p_{X|Y}(x|y) = p_{XY}(x,y), this is H(XY)=x,ypXY(x,y)log2(pXY(xy)) H(X|Y) = -\sum_{x, y} p_{XY}(x,y) \log_2(p_{X|Y}(x|y))

Now, let us consider another important proposition.

Proposition. H(XY)H(X)H(X|Y) \leq H(X) in general.

Proof. We want to show that H(X)H(XY)0H(X) - H(X|Y) \geq 0. Consider that H(X)H(XY)=x,ypXY(x,y)(log2(1pX(x))+log2(pXY(xy)))=x,ypXY(x,y)log2(pXY(xy)pX(x))=x,ypXYlog2(pX,Y(x,y)pX(x)pY(y))=D(pXYpXpY)0. H(X) - H(X|Y) = \sum_{x, y} p_{XY}(x, y) \left(\log_2\left(\frac{1}{p_X(x)}\right) + \log_2(p_{X|Y}(x|y))\right)\\ = \sum_{x, y} p_{XY}(x, y) \log_2\left(\frac{p_{X|Y}(x|y)}{p_X(x)}\right) \\ = \sum_{x, y} p_{XY}\log_2\left(\frac{p_{X,Y}(x,y)}{p_X(x)p_Y(y)}\right) \\ = D(p_{XY} \parallel p_X p_Y) \geq 0. \qquad \square Actually, we've proved more than just H(XY)H(X)H(X|Y) \leq H(X). We've shown that H(X)H(XY)=D(pXYpXpY)H(X) - H(X|Y) = D(p_{XY} \parallel p_X p_Y). But recall that this quantity is actually H(X)+H(Y)H(X,Y)H(X) + H(Y) - H(X, Y), which is also I(X;Y)I(X;Y). So we have another nice representation of I(X;Y)I(X;Y).

Corollary. (proved) The following are equal.

Also, since H(X)+H(Y)H(X,Y)=H(X)H(XY)H(X) + H(Y) - H(X,Y) = H(X) - H(X|Y), we have H(XY)=H(X,Y)H(Y)H(X|Y) = H(X,Y) - H(Y).

So far, we've introduced a lot of basic concepts from information theory. But the actual lemma we want to consider is coming next.

The lemma

Lemma. H(X,Y)+H(Y,Z)+H(X,Z)2H(X,Y,Z)H(X,Y) + H(Y,Z) + H(X,Z) \geq 2H(X,Y,Z).

This is a generalization of the inequality H(X)+H(Y)H(X,Y)H(X) + H(Y) \geq H(X,Y) to the case of three variables, and it is also a special case of Shearer's inequality (beyond the scope of what I know). Let us prove this.

Proof. We want to show that H(X,Y)+H(Y,Z)+H(X,Z)2H(X,Y,Z)0H(X,Y) + H(Y,Z) + H(X,Z) - 2H(X,Y,Z) \geq 0. Consider that H(X,Y)+H(Z)H(X,Y,Z)=I(X,Y;Z)H(X,Y) + H(Z) - H(X,Y,Z) = I(X,Y;Z) and H(Y,Z)+H(X)H(X,Y,Z)=I(X;Y,Z)H(Y,Z) + H(X) - H(X,Y,Z) = I(X;Y,Z). This means we actually want to prove I(X,Y;Z)H(Z)+I(X;Y,Z)H(X)+H(X,Z)0. I(X,Y;Z)-H(Z) + I(X;Y,Z)-H(X) + H(X,Z) \geq 0. But furthermore, H(X)+H(Z)H(X,Z)=I(X;Z)H(X) + H(Z) - H(X,Z) = I(X;Z) so what we really want to prove is I(X,Y;Z)+I(X;Y,Z)I(X;Z). I(X,Y;Z) + I(X;Y,Z) \geq I(X;Z). This is now intuitive. The mutual information between knowing X,YX,Y and ZZ, and between XX and Y,ZY,Z should be more than between XX and ZZ. In fact, just I(X;Y,Z)I(X;Y,Z) should already be greater than or equal to I(X;Z)I(X;Z). We rewrite I(X;Y,Z)I(X;Z)=H(X)+H(Y,Z)H(X,Y,Z)(H(X)+H(Z)H(X,Z))=H(Y,Z)H(Z)+H(X,Z)H(X,Y,Z)=H(YZ)H(YX,Z)0 I(X;Y,Z) - I(X;Z) = H(X) + H(Y,Z) - H(X,Y,Z) - (H(X) + H(Z) - H(X,Z)) \\ = H(Y,Z) - H(Z) + H(X,Z) - H(X,Y,Z)\\ = H(Y|Z) - H(Y|X,Z) \geq 0 and this completes the proof. \square

Actually, we need one more final ingredient. Consider the following theorem.

Theorem. H(X)log2(X(Ω))H(X) \leq \log_2(|X(\Omega)|).

Proof. H(X)=xX(Ω)pX(x)log2(1pX(x))=E[log2(1pX(x))]log2Eusing Jensen’s inequality[1pX(x)]=log2X(Ω) H(X) = \sum_{x \in X(\Omega)} p_X(x) \log_2\left(\frac{1}{p_X(x)}\right) \\ = \mathbb{E} \left[\log_2\left(\frac{1}{p_X(x)}\right)\right]\\ \leq \underbrace{\log_2 \mathbb{E}}_{\text{using Jensen's inequality}} \left[\frac{1}{p_X(x)}\right] \\ = \log_2 |X(\Omega)| \qquad \square

The Problem

Recall our problem from the beginning.

Problem. Given a set of nn distinct points in R3\mathbb{R}^3. We project the points onto the XYXY-plane, XZXZ-plane, and YZYZ-plane, one by one. Let nXY,nXZ,nYZn_{XY}, n_{XZ}, n_{YZ} be the number of distinct projected points (we identify the duplicates). Prove that nXYnYZnXZn2n_{XY}n_{YZ}n_{XZ} \geq n^2.

Observe that the statement nXYnYZnXZn2n_{XY}n_{YZ}n_{XZ} \geq n^2 is very similar to H(X,Y)+H(Y,Z)+H(X,Z)2H(X,Y,Z)H(X,Y) + H(Y,Z) + H(X, Z) \geq 2H(X,Y,Z). How about we take the logarithm of both sides? We now want to prove that log2nXY+log2nYZ+log2nXZ2log2n. \log_2 n_{XY} + \log_2 n_{YZ} + \log_2 n_{XZ} \geq 2 \log_2 n.

Consider the list of given nn points. Let us write them as (xi,yi,zi)(x_i, y_i, z_i) for i{1,,n}i \in \{1, \dots, n\}. Define a probability space Ω={1,,n}\Omega = \{1, \dots, n\} and X ⁣:Ω{x1,,xn}X \colon \Omega \to \{x_1, \dots, x_n\} given by X(i)=xiX(i) = x_i. We do the same for YY and ZZ. And now let us equip Ω\Omega with P\mathbb{P} defined as the uniformly random probability distribution. (i.e. P({i})=1n\mathbb{P}(\{i\}) = \frac{1}{n} for all i{1,,n}i \in \{1, \dots, n\}.) Now we can apply H(X,Y)+H(Y,Z)+H(X,Z)2H(X,Y,Z)H(X,Y) + H(Y,Z) + H(X,Z) \geq 2H(X,Y,Z) directly. Before that, apply the last theorem, we have H(X,Y)log2(nXY)H(X, Y) \leq \log_2(n_{XY}), H(Y,Z)log2(nYZ)H(Y,Z) \leq \log_2(n_{YZ}) and H(X,Z)log2(nXZ)H(X,Z) \leq \log_2(n_{XZ}). Now expand H(X,Y,Z)=x,y,zpXYZ(x,y,z)log2(1pXYZ(x,y,z))=1n1nlog2(n)=log2(n). H(X,Y,Z) = \sum_{x,y,z} p_{XYZ}(x,y,z) \log_2\left(\frac{1}{p_{XYZ}(x,y,z)}\right) = \sum_{1}^n \frac{1}{n} \log_2(n) = \log_2(n). Hence, we have log2(nXY)+log2(nYZ)+log2(nXZ)H(X,Y)+H(Y,Z)+H(X,Z)2H(X,Y,Z)=2log2(n). \log_2(n_{XY}) + \log_2(n_{YZ}) + \log_2(n_{XZ}) \geq H(X,Y)+H(Y,Z)+H(X,Z) \geq 2H(X,Y,Z) = 2\log_2(n). So, by exponentiating both sides, we have nXYnYZnXZn2 n_{XY}n_{YZ}n_{XZ} \geq n^2 and hence completes the proof. \square

Final thoughts

One may argue that the tools given here (from information theory) is just a "shorthand" or "syntactic sugar" for other elementary tools. While that is somehow legit, I'd say that other deep mathematics are also just a sequence of shorthands also. Information theory provides the notation and basic results that can also be interpreted intuitively, i.e., H(X)H(X) as the amount of information needed to convey the result of XX, D(PQ)D(P \parallel Q), etc. And these can be quite useful in purer or more theoretical parts of mathematics and computer science where we might lack other direct intuitions to solve them.

All the above content is just a summary of three one-hour-lectures. And so there are a lot more to learn in information theory. I hope I will have the chance to study more on information theory later when I have time. And I hope this article might give the reader some insight on information theory.