Shannon's Information Measures

Table of Contents

Information theory is a subject with very rich intellectual contents, and it is somewhat philosophical. Information theory was founded by Bell Telephone Laboratory scientist Claude Shannon, with the seminal paper “The Mathematical Theory of Communication” in 1948. Information theory studies fundamental limits in communications, in the form of transmission, storage, etc.

There are two key concepts in information theory:

Information is uncertainty, and so an information source is modeled as a random variable.
Information is digital, transmission should be 0’s and 1’s with no reference to what they represent.

Fundamental Theorems

The first theorem is the source coding theorem that establishes the fundamental limit in data compression. The idea of the source coding theorem is that no matter how smart the data compression algorithm is designed, there is always a minimum size that the file can be compressed.

The second theorem is the channel coding theorem that establishes the fundamental limit for reliable communication through a noisy channel. The idea is that no matter how smart the communication (system) is designed, for reliable communication, there exists a maximum rate that information can be transmitted through the channel. This quantity is called the channel capacity.

Independence and Markov Chain

X denotes a discrete random variable taking values in a set ? (script X, called the alphabet of the random variable X). Then:

{p_X(x)} is the probability distribution for random variable X.
S_X is the support of X, i.e., {x ∈ ? | p_X(x) > 0}. If S_X = ?, we say that p is strictly positive. Non-strictly positive distributions are dangerous, we need to handle them with great care.

Recall some basic definitions:

Independence. Two random variables X and Y are independent, denoted by X ⊥ Y, if p(x, y) = p(x) p(y) for all x and y, i.e.for all (x, y) ∈ ? × ?.

Mutual independence. When n ≥ 3, random variable X₁, X₂, … X_n are mutually independent if for all x₁, x₂, … x_n:

p(x₁, x₂, ..., x_n) = p(x₁) p(x₂) ... p(x_n)

Pairwise Independence. Pairwise Independence is implied by mutual independence but not vice versa. When n ≥ 3, random variable X₁, X₂, … X_n are pairwise independent if X_i and X_j are independent for all 1 ≤ i < j ≤ n.

Conditional Independence. For random variable X, Y, and Z; X is independent of Z conditioning on Y, denoted by X ⊥ Z | Y, if:

if p(y) > 0:
  p(x, y, z) = p(x, y) p(y, z) / p(y) = p(x, y) p(z | y)
otherwise:
  p(x, y, z) = 0

Proposition: For random variables X, Y, and Z. X ⊥ Z | Y if and only if p(x, y, z) = a(x, y) b(y, z), for all x, y, and z such that p(y) > 0. Here a(x,y) and b(y,z) are not necessarily probability distributions: a is a function that depends only on x and y, and b is a function that depends only on y and z. Note however, that the choice of a(x,y) and b(y,z) is not unique.

Markov Chain. For random variables X₁, X₂, …, X_n, where n ≥ 3, X₁ → X₂ → … → X_n forms a Markov chain if:

if p(x₂), p(x₃), ... p(x_n-1) > 0:
  p(x₁, x₂, ..., x_n) = p(x₁, x₂) p(x₃ | x₂) ... p(x_n | x_n-1)
otherwise:
  p(x₁, x₂, ..., x_n) = 0

X1 → X2 → X3 is equivalent to X₁ ⊥ X₃ | X₂.

Proposition: X₁ → X₂ → … → X_n forms a Markov chain if and only if X_n → X_n-1 → … → X₁ forms a Markov chain.

Proposition: X₁ → X₂ → … → X_n forms a Markov chain if and only if

X₁ → X₂ → X₃
(X₁, X₂) → X₃ → X₄
…
(X₁, X₂, …, X_n-2) → X_n-1 → X_n

form a Markov chain.

Proposition: X₁ → X₂ → … → X_n forms a Markov chain if and only if

p(x₁, x₂, ..., x_n) = f₁(x₁, x₂) f₂(x₂, x₃) ... f_n-1(x_n-1, x_n)

for all x₁, x₂, …, x_n such that p(x₂), p(x₃), …, p(x_n-1) > 0.

Proposition: Markov Subchain. Let ?_n be the index set = {1, 2, …, n} and let X₁ → X₂ → … → X_n form a Markov chain. For any subset ɑ of ?_n, denote (X_i, i ∈ ɑ) by X_ɑ. Then for any disjoint subsets ɑ₁, ɑ₂, …, ɑ_m of ?_n such that k₁ < k₂ < … < k_m for all k_j ∈ ɑ_j, j = 1, 2, …, m, we have X_ɑ1 → X_ɑ2 → … → X_ɑm forms a Markov chain. That is a subchain of a Markov chain is also a Markov chain.

Shannon’s Information Measures

There are 4 types of Shannon’s information measures:

Entropy
Conditional entropy
Mutual information
Conditional mutual information

Entropy

Entropy measures the uncertainty of a discrete random variable. The entropy H(X) of a random variable X is defined as H_ɑ(X) = -Σ_x p(x) log_ɑ p(x). The convention is the summation is taken over the support of X, i.e S_X. The entropy of a random variable is positive.

The unit of entropy is:

bit (different from a bit in computer science)	if ɑ = 2
nat	if ɑ = e
D-it	if ɑ = some positive integer D

H(X) depends only on the distribution of X but not on the actual values taken by X, hence also written as H(p_X).

Entropy can also be expressed as expectation. The expectation of g(X) for some random variable X is

E[g(X)] = Σ_x p(x) g(x)

where the summation is over the support of X. We also have linearity for expectation, so E[ f(X) + g(X) ] = E[f(X)] + E[g(X)]. So we can write entropy of X as below:

H(X) = -E[log p(X)] = -Σ_x p(x) log p(x)

Note: in probability theory, when expectation of g(x) for random variable is considered, usually g(x) depends only on the value of x but not on the probability of x. In information theory, this is not the case.

Binary Entropy Function

For 0 ≤ γ ≤ 1, define the binary entropy function h_b(γ) = -γ log(γ) -(1 - γ) log(1 - γ) with the convention that 0 log0 = 0, as by L’Hopital’s rule lim_a→0 a log(a) = 0.

For binary variable X with distribution γ and 1 – γ, H(X) = h_b(γ). h_b(γ) achieves the maximum value 1 when γ = 1/2.

Joint Entropy

The joint entropy H(X, Y) of a pair of random variables X and Y is defined as:

H(X, Y) = -Σ_x,y p(x, y) log p(x, y) = -E log p(X, Y)

Conditional Entropy

For random variables X and Y, the conditional entropy of Y given X is defined as:

H(Y|X) = -Σ_x,y p(x,y) log p(y|x) = -E log p(Y|X)

Note when we cover up the conditioning of X, this reduces to the usual definition for the entropy of Y.

The conditional entropy of Y given X is equal to the way the sum of the conditional entropy of Y given a particular x over the distribution p(x).

H(Y|X) = Σ_x p(x) H(Y|X=x)

Similarly, we can write H(Y|X,Z) = Σ_z p(z) H(Y|X, Z = z), where

H(Y|X, Z=z) = -Σ_x,y p(x,y|z) log p(y|x,z)

Note that when we cover up the conditioning on a particular z, this formula reduces to the usual definition of entropy of Y given X.

Proposition. H(X,Y) denotes the uncertainty contained in the random variables X and Y.

H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

On the left hand side, consider that X and Y are revealed in “one step”. On the right hand side imagine that we reveal X and Y in “two steps”. Either way, the total amount of entropy is actually the same.

Mutual Information

For random variables X and Y, the mutual information between X and Y is defined as:

I(X;Y) = Σ_x,y p(x,y) log( p(x,y)/[p(x)p(y)] )
= E log( p(X,Y)/[p(X)p(Y)] )

I(X;Y) is symmetrical in X and Y. Alternatively, it can be written as:

I(X;Y) = Σ_x,y p(x,y) log( p(x,y)/[p(x)p(y)] )
= Σ_x,y p(x,y) log( p(x|y)/p(x) )
= E log( p(X|Y)/p(X) )

Proposition. The mutual information between a random variable X and itself is equal to the entropy of X:

I(X;X) = H(X)

The entropy of X is sometimes called the self-information of X.

Proposition. The mutual information between X and Y is equal to the amount of information that we can obtain about X by knowing Y.

I(X;Y) = H(X) - H(X|Y)
I(X;Y) = H(Y) - H(Y|X)
⟹ I(X;Y) = H(Y) - H(Y|X)
⟹ I(X;Y) = H(Y) - ( H(X, Y) - H(X) )
⟹ I(X;Y) = H(Y) + H(X) - H(X, Y)

The last equation is analogous to the set theoretical equation μ(A∩B) = μ(A) + μ(B) - μ(A∪B), where μ is a set additive function and A and B are sets.

Conditional Mutual Information

For random variables X, Y and Z, the mutual information between X and Y conditioning on Z is defined as

I(X;Y|Z) = Σ_x,y,z p(x,y,z) log( p(x,y|z) / [ p(x|z) p(y|z) ] )
= E log( p(X,Y|Z) / [ p(X|Z) p(Y|Z) ] )

I(X;Y|Z) is symmetrical in X and Y. Similar to entropy we have I(X;Y|Z) = Σ_z p(z) I(X;Y | Z=z), where

I(X;Y | Z=z) = Σ_x,y p(x,y|z) log( p(x,y|z) / [p(x|z) p(y|z)] )

Proposition. The mutual information between a random variable X and itself conditioning on a random variable Z is equal to the conditional entropy of X given Z, i.e. I(X;X | Z) = H(X | Z).

Proposition.

I(X;Y | Z) = H(X | Z) - H(X | Y,Z)
I(X;Y | Z) = H(Y | Z) - H(Y | X,Z)
I(X;Y | Z) = H(Y | Z) + H(X | Z) - H(X, Y |Z)

For more on Shannon’s Information Measures, please refer to the wonderful course here https://www.coursera.org/learn/information-theory

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!