# Mutual information

Peter E. Latham and Yasser Roudi (2009), Scholarpedia, 4(1):1658. | doi:10.4249/scholarpedia.1658 | revision #186917 [link to/cite this article] |

**Mutual information** is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.

## Contents |

## Definition

For two discrete variables \(X\) and \(Y\) whose joint probability distribution is \(P_{XY}(x,y)\ ,\) the mutual information between them, denoted \(I(X;Y)\ ,\) is given by (Shannon and Weaver, 1949; Cover and Thomas, 1991)

\[\tag{1} I(X;Y) = \sum_{x,y} P_{XY}(x,y) \log {P_{XY}(x,y) \over P_X(x) P_Y(y)} = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \, . \]

Here \(P_X(x)\) and \(P_Y(y)\) are the marginals\[P_X(x) = \sum_y P_{XY}(x,y)\] and \(P_Y(y) = \sum_x P_{XY}(x,y)\) and \(E_P\) is the expected value over the distribution \(P\ .\) The focus here is on discrete variables, but most results derived for discrete variables extend very naturally to continuous ones – one simply replaces sums by integrals. One should be aware, though, that the formal replacement of sums by integrals hides a great deal of subtlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details.

The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measured in bits. For other bases, see Table 1.

base | units | conversion |
---|---|---|

2 | bits | 1 bit = 1 bit |

\(e\) | nats | 1 bit = \(\log_e 2 \ (\approx 0.693)\) nats |

10 | bans | 1 bit = \(\log_{10} 2 \ (\approx 0.301)\) bans |

## Interpretation

To understand what \(I(X;Y)\) actually means, we first need to define
*entropy* and *conditional entropy*.

Qualitatively, entropy is a measure of uncertainty – the higher the entropy, the more uncertain one is about a random variable. This statement was made quantitative by Shannon. He postulated that a measure of uncertainty of a random variable \(X\) should be a continuous function of its probability distribution \(P_{X}(x)\) and should satisfy the following conditions:

- It should be maximal when \(P_{X}(x)\) is uniform, and in this case it should increase with the number of possible values \(X\) can take;

- It should remain the same if we reorder the probabilities assigned to different values of \(X\ ;\)

- The uncertainty about two independent random variables should be the sum of the uncertainties about each of them.

He then showed that the only measure of uncertainty that satisfies all these conditions is the entropy, defined as

\[\tag{2} H(X)=-\sum_x P_X(x) \log P_X(x) = - E_{P_X} \log P_X \]

Although not particularly obvious from this equation, \(H(X)\) has a very concrete interpretation: Suppose \(x\) is chosen randomly from the distribution \(P_X(x)\ ,\) and someone who knows the distribution \(P_X(x)\) is asked to guess which \(x\) was chosen by asking only yes/no questions. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like "is \(x\) greater than \(x_0\ ?\)", then the average number of yes/no questions it takes to guess \(x\) lies between \(H(X)\) and \(H(X)+1\) (Cover and Thomas, 1991). This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy. See the Scholarpedia article on entropy for an example.

The conditional entropy is the average uncertainty about \(X\) after observing a second random variable \(Y\ ,\) and is given by

\[\tag{3} H(X|Y)=\sum_y P_Y(y) \left[ - \sum_x P_{X|Y}(x|y) \log \left(P_{X|Y}(x|y)\right)\right] = E_{P_Y} \left[ - E_{P_{X|Y}} \log P_{X|Y} \right] \]

where \(P_{X|Y}(x|y) (\equiv P_{XY}(x,y)/P_Y(y))\) is the conditional probability of \(x\) given \(y\ .\)

With the definitions of \(H(X)\) and \(H(X|Y)\ ,\) Eq. (1) can be written

\[\tag{4} I(X;Y)=H(X)-H(X|Y). \]

Mutual information is therefore the *reduction* in uncertainty about variable \(X\ ,\) or the expected reduction in the number of yes/no questions needed to guess \(X\) after observing \(Y\ .\) Note that the yes/no question interpretation even applies to continuous variables: although it takes an infinite number of questions to guess a continuous variable, the * difference* in the number of yes/no questions it takes to guess \(X\) before versus after observing \(Y\) may be finite, and is the mutual information. While problems can arise when going from discrete to continuous variables (subtracting infinities is always dangerous), they rarely do in practice; see (Gray, 1990).

## Properties

- Mutual information is intimately related to the Kullback-Leibler divergence (Cover and Thomas, 1991), a very natural measure of the "distance" between two distributions. It is defined as follows: for any two distributions \(P(z)\) and \(Q(z)\ ,\)

\[ D_{KL} \Big( P(z)||Q(z) \Big) \equiv \sum_z P(z) \log \left[ {P(z) \over Q(z)} \right] \, . \]

Distance is in quotes because the Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that \(D_{KL} \Big(P(z)||Q(z) \Big)\) is non-negative, and zero if and only if \(P(z) = Q(z)\ .\)

From Eq. (1), it is easy to see that the mutual information is just the Kullback-Leibler distance between the joint distribution, \(P_{XY}(x,y)\ ,\) and the product of the independent ones, \(P_X(x)P_Y(y)\ ,\)

\[ I(X;Y) = D_{KL} \Big( P_{XY}(x,y)||P_X(x)P_Y(y) \Big) \, . \]

Thus, another way to think about mutual information is that it is a measure of how close the true joint distribution of \(X\) and \(Y\) is to the
independent joint distribution. Because the Kullback-Leibler distance, and thus the mutual information, is zero if and only if \(P_{XY}(x,y) = P_X(x)P_Y(y)\ ,\) it follows that the mutual information captures *all* dependencies between random variables, not just, say, second order ones, as captured by the covariance.

- Mutual information is symmetric\[I(X;Y)=I(Y;X)\ .\] This follows directly from the definition, Eq. (1).

- Mutual information is
*additive*for independent variables. More quantitatively, if \(P_{XYWZ}(x,y,w,z) = P_{XY}(x,y) P_{WZ}(w,z)\ ,\) then

\[ I(X,W; Y,Z) = I(X;Y) + I(W;Z) \, . \]

This follows easily from the definition of the mutual information. It's also a property that information *should* have. For example, suppose \(X\) and \(Y\) are the cost of a bottle of wine and how it tastes, and \(W\) and \(Z\) are the height and weight of tree squirrels. If cost provides 1 bit of information about taste and height provides 2 bits of information about weight, then it makes sense that cost+weight should provide 3 bits of information about taste+height. Assuming, of course, that squirrels (and their weights) do not influence wines (and their prices).

- The Data Processing Inequality (DPI) states, loosely, that post-processing cannot increase information. More quantitatively, consider two random variables, \(X\) and \(Y\ ,\) whose mutual information is \(I(X,Y)\ .\) Now consider a third random variable, \(Z\ ,\) that is a (probabilistic) function of \(Y\)
*only*. The*only*qualifier means \(P_{Z|XY}(z|x,y)=P_{Z|Y}(z|y)\ ,\) which in turn implies that \(P_{X|YZ}(x|y,z)=P_{X|Y}(x|y)\ ,\) as is easy to show using Bayes' theorem. The DPI states that \(Z\) cannot have more information about \(X\) than \(Y\) has about \(X\ ;\) that is,

\[ I(X;Z) \le I(X;Y) \, . \]

This inequality, which again is a property information *should* have, is easy to prove,

\[ I(X;Z) = H(X)-H(X|Z) \leq H(X) - H(X|Y,Z) = H(X) - H(X|Y) = I(X;Y). \]

The first equality follows from Eq. (4); the inequality follows because conditioning on an extra variable (in this case \(Y\) as well as \(Z\)) can only decrease entropy, and the second to last equality follows because \(P_{X|YZ}(x|y,z)=P_{X|Y}(x|y)\ .\)

## The channel coding theorem

Mutual Information is just one way among many of measuring how related two random variables are. However, it is a measure ideally suited for analyzing communication channels.

Abstractly, a communication channel can be visualized as a transmission medium which receives an input \(x\) and produces an output \(y\ .\) If the channel is *noiseless*, the output will be equal to the input. However, in general, the transmission medium is noisy and an input \(x\) is converted to an output \(y\) with probability \(P_{Y|X}(y|x)\ .\)

Given a communication channel, one can transmit any message \(\mathbf{s}\) from a set of \(M\) possible messages by performing the following three steps:

- To each message \(\mathbf{s}\) assign a string \(\mathbf{x}(\mathbf{s})=(x_1,\dots,x_n) \) of length \(n\ .\) Each \(\mathbf{x}(s)\) is called a
*codeword*. The deterministic mapping from the set of messages to the set of codewords is called the*encoding function*. - To transmit \(\mathbf{s}\ ,\) transmit the corresponding string \(\mathbf{x}(\mathbf{s})\) over the channel. This yields an output string \(\mathbf{y}\ ,\) also of length \(n\ .\) For the so called
*memoryless channles*, each \(x_i\) in the input string is mapped to an output \(y_i\) with probability \(P_{Y|X}(y|x)\ ,\) independent of the other \(x_j\ .\) - The output string \(\mathbf{y}\) is used to recover the transmitted message, using a deterministic decoding function. The decoding function maps each \(\mathbf{y}\) to one symbol \(\mathbf{s'}\ .\)

There are two elements of a communication channel that should be emphasized. First, the number of messages is typically much less than the number of possible messages. For example, if the \(x_i\) were binary, for which the number of possible messages is \(2^n\ ,\) then one would typically have \(M \ll 2^n \) (see Figure 1). Second, for each message, the \(x_i\) are chosen randomly, and independently, from a distribution denoted, as usual, \(P_X(x)\ .\) When designing a channel, then, one has control over only two quantities: \(M\) and \(P_X(x)\ .\) In fact, one usually adjusts \(P_X(x)\) to make the number of messages, \(M\ ,\) large while at the same time keeping the error rate - the rate at which messages are decoded incorrectly - small. Note that the conditional distribution, \(P_{Y|X}(y|x)\ ,\) is a physical property of the channel itself, and thus not under the control of the designer.

The three steps of the communication channel are illustrated as

\[ \mathbf{s} \ \mathbf{\xrightarrow{Encoding}} \ x_1 x_2 ... x_n\ \rightarrow \Big[ \ {\rm noisy \ channel} \! : P_{Y|X}(y|x) \ \Big] \rightarrow y_1 y_2 ... y_n\ \mathbf{\xrightarrow{Decoding}} \ \mathbf{s'} \, . \]

One would like \(\mathbf{s'}\) to be the same as \(\mathbf{s}\ ,\) but, because of noise, there is no way to absolutely guarantee this. However, the **channel coding theorem** (Shannon and Weaver, 1949) states that one can come close, so long as \(M\) is not too large. Moreover - and this is where the link between mutual information and communication channels arises - the maximum number of messages that can be transmitted almost error-free is a function of the mutual information between \(X\) and \(Y\ .\) Specifically, the channel coding theorem states the following:

First, define the channel capacity, \(C\ ,\) as the maximum mutual information with respect to the input distribution, \(P_X\ ,\)

\[ C=\max_{p_X} I(X;Y) \, . \]

Then, assume that \(M = 2^{nR}\) messages are encoded in strings of length \(n\ .\) If \(R < C\ ,\) the probability of an error (i.e., the probability that \(\mathbf{s} \ne \mathbf{s'}\)) is exponentially small in \(n\ .\) If, on the other hand, \(R \ge C\ ,\) then errors are likely. More formally,

For \(R=\frac{\log(M)}{n}<C \ ,\) it is possible to transmit messages with a maximum probability of error which is exponentially small in \(n\ .\) Conversely, if \(R>C \ ,\) this is not possible.

In the above theorem, \(R\) is called the *rate*.

Although at first glance the ability to transmit information over a noisy channel almost error free looks mysterious, it's based on a very simple principle: send only a subset of the possible messages (that is, choose \(M\) such that \(M < 2^{nC}\)). The idea is illustrated in Figure 1. This figures shows a set of input messages, represented by the dots in the left hand panels. Because of noise, a particular input message could map to any one of the messages inside the circles in the right hand panels. If one used all possible messages, as in Figure 1a, and tried to decode - to map \(Y\) back to \(X\) - one would make errors. For example, the red message on the right hand side of Figure 1a could be mapped back to any of three input messages. If, on the other hand, one used only a subset of the messages, as in Figure 1b, then decoding could be done perfectly. Although this main idea is simple to grasp, the power of the channel coding theorem lies in the fact that the maximum number of messages that can be transmitted almost error free increases *exponentially* with \(n\) (\(M \sim 2^{nC}\)).

A detailed proof of the channel coding theorem is beyond the scope of this article and we refer the readers to the relevant literature (Shannon and Weaver 1949, Cover and Thomas 1990). It is, however, possible to see the relation between the rate of transmission and the probability of error quite intuitively, as shown below.

Suppose that we want to transmit \( M \) messages using codewords of length \( n \ .\) If we send a codeword \(\mathbf{x}^\mathrm{true} \equiv (x_1^\mathrm{true}, x_2^\mathrm{true}, ..., x_n^\mathrm{true})\ ,\) and it is converted to \(\mathbf{y} \equiv (y_1, y_2, ..., y_n)\) on the output side, there is a very natural way to translate from \(\mathbf{y}\) back to \(\mathbf{x}\ :\) choose the message most likely to have caused it (Mézard and Montanari, 2009). That is, compute \(p(\mathbf{y}|\mathbf{x})\) - the probability that \(\mathbf{y}\) was produced by codeword \(\mathbf{x}\) - for all codewords, and choose the one that maximizes this probability. We denote the probability of mistakenly decoding a randomly chosen codeword, \(\mathbf{x}\ ,\) by \(P_n\ .\) This is defined as

\[ P_n = \mathrm{Prob} \left[ p(\mathbf{y}|\mathbf{x}) > p(\mathbf{y}|\mathbf{x}^\mathrm{true}) \right] \, . \]

Since \(\mathbf{x}\) was chosen randomly in the above expression, it follows that the total probability of an error when decoding messages, denoted \(P_{tot}\ ,\) is given by

\[ P_{tot} = 1 - (1-P_n)^{M} \approx 1 - \exp[-M P_n] \, . \]

For \(P_{tot}\) to be small, \(M P_n\) must be much less than one. This provides a bound on the number of messages, \( M \ ,\) that can be sent almost error free using codewords of length \( n \)

\[\tag{5} M \ll 1/P_n \, . \]

It is a straightforward exercise in large deviation theory (see Chapter 12 of Cover & Thomas, 1991) to show that \(P_n\) decreases exponentially with \( n \ .\) In fact, as motivated in the example below, in the large \(n\) limit, \(P_n\) is given by

\[ P_n = 2^{-nI(X;Y)} \, . \]

If we now let

\[ M = 2^{n[I(X;Y)-\epsilon]} \]

with \(\epsilon\) positive, we have automatically satisfied Eq. (5), in the large \(n\) limit. Thus, this analysis implies that \(2^{nI(X;Y)}\) is an upper bound on the number of messages of length \(n\) can be sent almost error free. As \(n\) increases, one comes closer to this bound.

The connection between mutual information and the number of messages that can be sent is a deep one, but it turns out to be fairly easy to understand, as can be seen with the following example.

Consider a communication channel that transmits 0s and 1s, and transmits them correctly with probability \(q\ ;\) that is,

\[ \begin{array}{ll} P_{Y|X}(1,1) & = q \\ P_{Y|X}(0,0) & = q \, . \end{array} \]

For simplicity assume that \(P_X(0)=P_X(1)=1/2\ ,\) which in turn implies that \(P_{X|Y}(1,1) = P_{X|Y}(0,0) = q.\) The mutual information for this distribution is given by

\[\tag{6} \begin{array}{ll} I(X;Y) & = H(X) - H(X|Y) \\ & = -(1/2) \log(1/2) -(1/2) \log(1/2) -(1/2)[-q \log q - (1-q) \log(1-q)] -(1/2)[-q \log q - (1-q) \log(1-q)] \\ & = -\log(1/2) + q \log q + (1-q)\log(1-q) \end{array} \]

As usual, one constructs a codebook containing only a subset of possible strings of 0s and 1s; that is, for strings of length \(n\ ,\) the codebook contain far fewer than \(2^n\) messages. Suppose a particular codeword, \(\mathbf{y} \equiv (y_1, y_2, ..., y_n)\ ,\) was received. How would one figure out which message was sent? The idea is to make use of the fact that if a 1 was received then there is a probability \(q\) that a 1 was sent, and similarly for 0. For definiteness, take \(q=0.9\ .\) Then, one simply looks for the codeword that has approximately 90% 1s at positions in the string where the received symbol was 1, and approximately 90% 0s at positions in the string where the received symbol was 0. For example, consider following received message and two candidate codewords,

\[ {\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1}{\color{red}0} \ \ \mathrm{received \ message} \]

\[ {\color{red}0}{\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{green}0}{\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{blue}1}{\color{red}0} \ \ \mathrm{likely \ codeword} \]

\[ {\color{red}0}{\color{green}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}0} {\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}1}{\color{red}0}{\color{green}0} {\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}0}{\color{red}0}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{green}0} {\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{green}0}{\color{red}0} \ \ \mathrm{unlikely \ codeword} \]

Incorrect matches are color-coded in green. For the likely codeword, only about 10% of the symbols are green, whereas for the unlikely codeword, about 50% are green.

There are two points to this example. One is that the probability of making an error - of incorrectly decoding a message - is about equal to the probability that a randomly chosen message, one in which the probability of a 0 or 1 in each position is 1/2, will have 90% of its 1s and 90% of its zeros in just the right places. The other is that this probability is not hard to calculate: it is the same as the probability of flipping \(n\) coins, and having the first \(n/2\) come up 90% heads and the second \(n/2\) come up 90% tails. This probability (with 0.9 replaced by q) is given by

\[ P_n = \left[(1/2)^{n/2} \frac{(n/2)!}{(qn/2)!((1-q)n/2)!} \right]^2 \, . \]

The first factor, \((1/2)^{n/2}\) is the probability of a particular message occurring; the second enumerates all the ways to arrange the symbols. Using Stirling's formula and taking the large \(n\) limit, this expression can be written

\[ P_n \approx 2^{-n[- \log_2 (1/2) + q \log_2 q + (1-q) \log_2 (1-q)]} = 2^{-nI(X;Y)} \, . \]

where the last equality follows from Eq. (6). This analysis, which required only the use of Stirling's formula (and can be easily extended to the general case), provides a link between information theory and communications channels.

Note that two things are missing from this example. First, although we have computed the maximum number of messages (\(1/P_n\ ,\) or \(2^{nI(X;Y)}\)), we have not discussed how to choose the messages. One way, which is close to optimal, is to generate each of the \(2^{n[I(X;Y)-\epsilon]}\) messages by randomly choosing the \(x_i\) from the distribution \(P_X(x)\ .\) Second, we have not discussed how to optimize mutual information with respect to \(P_X(x)\ .\) In general this must be done numerically. For the above example, though, there it is simple to find an analytic solution. Start by writing information as

\[\tag{7} I(X;Y) = H(Y) - H(Y|X) = h(q P_X(1) + (1-q)(1-P_X(1)) - h(q) \, . \]

where

\[ h(q) \equiv -q \log q - (1-q) \log (1-q) \]

and the second equality in Eq. (7) follows from a small amount of algebra. As is easy to see, \(h(q)\) is maximum when \(q=1/2\ ,\) at which point \(h(1/2)=1\ .\) For the first appearance of \(h\) in Eq. (7), this maximum occurs when \(P_X(1)=1/2\ .\) Thus, this channel is optimal when 0 and 1 appear on the input end with equal probability, and the channel capacity is \(1-h(q)\ .\)

While this fundamental theorem puts bounds on transmitted information, it does not tell us how to reach those bounds in practical situations. An active area of research is designing codes - known as error correcting codes - that come as close as possible to the limit given by mutual information.

## References

- Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
- Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, NY.
- Nirenberg, S. and Latham, P.E. (2003). Decoding neuronal spike trains: how important are correlations? Proc. Natl. Acad. Sci. 100:7348-7353.
- Shannon, C.E. and Weaver, W. (1949). The mathematical theory of communication. University of Illinois Press, Urbana, Illinois.
- Mézard, M. and Monatanari, A. (2009). Information, Physics, and Computation. Oxford University Press, Oxford.

**Internal references**

- Nestor A. Schmajuk (2008) Classical conditioning. Scholarpedia, 3(3):2316.

- Tomasz Downarowicz (2007) Entropy. Scholarpedia, 2(11):3901.

## Further reading

- Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
- Mézard, M. and Monatanari, A. (2009). Information, Physics, and Computation. Oxford University Press, Oxford.

## External links

- Peter E. Latham's website
- Yasser Roudi's website
- Mutual Information (Wikipedia)
- Error correcting codes (Wikipedia)