Written by Alexei Gilchrist, updated
Find the probability distribution that maximises the entropy subject to requiring some averages to be fixed.
Level: 3, Subjects: Probability

1 Introduction

In the following treatment the setting is that we have a problem described by a variable \(x\) that can take on possibilities \(\{x_1,x_2,\ldots,x_n \}\) and we wish to assign probabilities to each possibility. Without any further information we would be justified in assigning them equal probability since we don't have a reason to distinguish between any related problem where the labels have been permuted and an equal probability is the only solution invariant under all permutations. This is the principle of “insufficient reason”. However, imagine that we know the value of some averages that \(x\) must satisfy. Maybe prior experience with the problem or other factors have furnished these.

These averages specify constraints on the problem that have the form \begin{align} \langle f_k(x) \rangle = \sum_j p_j f_k(x_j) = \bar{f}_k. \end{align} where \(f_k\) can be arbitrary functions and the known average of those functions are \(\bar{f}_k\). Say there are \(M\) such constraints so \(k\in[1\ldots M]\). Subject to these constraints (and that probabilities have to sum to 1) we want to find the probabilities that maximise the entropy of the resulting distribution so that we are making the most conservative assignment of probabilities.

2 Maximising the entropy

The way to proceed is by using Lagrange multipliers \(\lambda_k\), solving the equations \begin{align} \frac{\partial}{\partial p_i}\left\{-\sum_j p_j\ln p_j- \sum_k \lambda_k\left(\sum_j p_j f_k(x_j)-\bar{f}_k\right) -\mu\left(\sum_j p_j -1\right)\right\} = 0. \end{align} The first term is the entropy, the middle terms fix the constraints to their mean values and the final term fixes the sum of the probabilities to unity. Performing the differentiation we get \(M\) decoupled equations that are identical in form \begin{align} -\ln p_i -1 -\sum_k \lambda_k f_k(x_i)-\mu=0 \end{align} With the solutions \begin{align} p_i = \exp\left(-\lambda_0- \sum_k \lambda_k f_k(x_i)\right) \end{align} where \(\lambda_0=\mu+1\), and the \(\lambda\)'s are constants we still need to evaluate.

The constant \(\lambda_0\) can be evaluated immediately by making use of the constraint that the probabilities have to add to one: \begin{align} 1=\sum_i p_i = e^{-\lambda_0}\sum_i e^{- \sum_k \lambda_k f_k(x_i)}, \end{align} so \begin{align} e^{\lambda_0} = \sum_i e^{- \sum_k \lambda_k f_k(x_i)} \equiv Z(\lambda_1,\ldots,\lambda_M). \end{align} \(Z(\lambda_1,\ldots,\lambda_M)\) is known as the partition function in other contexts, and the probabilities we should assign subject to the average constraints are \begin{equation} p_i = \frac{1}{Z}e^{- \sum_k \lambda_k f_k(x_i)}. \end{equation} We still have \(M\) constraints that can be used to evaluate the remaining \(M\) Lagrange multipliers, but it's necessary to know the specific form of the functions \(f_k\) to proceed further.

What we have just derived is really quite powerful as will become evident when we look at some examples. Before we do that though, let's look at some properties of the solution we've found.

3 Properties

First, taking derivatives of the logarithm of the partition function with respect to the Lagrange multipliers yields information on the moments of the corresponding function (remember \(Z=Z(\lambda_1,\ldots,\lambda_M)\)): \begin{align*} -\frac{\partial}{\partial \lambda_j}\ln Z =& -\frac{1}{Z} \frac{\partial Z}{\partial \lambda_j} = - \frac{1}{Z} \frac{\partial}{\partial \lambda_j} \sum_i e^{ -\sum_k \lambda_k f_k(x_i)} \\ =& \frac{1}{Z} \sum_i f_j(x_i)e^{ -\sum_k \lambda_k f_k(x_i)} \\ =& \sum_i f_j(x_i) p_i \\ =& \langle f_j(x)\rangle. \end{align*} So in order to determine the rest of the Lagrange multipliers we need to solve the set of \(M\) equations \begin{equation} -\frac{\partial}{\partial \lambda_j}\ln Z = \bar{f_j}. \end{equation}

Alternatively, the maximum entropy that is obtained is \begin{align} H_\text{max}(\bar{f}_1,\ldots,\bar{f}_M) &= -\sum_j p_j \ln p_j \\ &= -\sum_j p_j \left(-\sum_k \lambda_k f_k(x_j) - \ln Z\right)\\ &=\sum_k \lambda_k\bar{f}_k - \ln Z. \end{align} Then thinking of the Lagrange multipliers as functions of the mean values \(\lambda\equiv\lambda(\bar{f}_1,\ldots,\bar{f}_M)\) we have \begin{align} \frac{\partial H_\text{max}}{\partial \bar{f}_j} & = \sum_k \frac{\partial \lambda_k}{\partial \bar{f}_j}\bar{f}_k + \lambda_j + \sum_k \frac{\partial \ln Z}{\partial \lambda_k}\frac{\partial \lambda_k}{\partial \bar{f}_j} \\ & = \sum_k \frac{\partial \lambda_k}{\partial \bar{f}_j}\bar{f}_k + \lambda_j - \sum_k \langle f_k(x) \rangle\frac{\partial \lambda_k}{\partial \bar{f}_j}\\ & = \lambda_j. \end{align} So if we knew the entropy in terms of the mean values we could calculate evaluate the Lagrange parameters.

Differentiating again gives the variance (apply the product and chain rules to the second line above), though note the sign change: \begin{align*} \frac{\partial^2}{\partial \lambda_j^2}\ln Z =& \frac{1}{Z}\sum_i f_j(x_i)^2e^{ -\sum_k \lambda_k f_k(x_i)} -\frac{1}{Z^2}\left(\sum_i f_j(x_i)e^{ -\sum_k \lambda_k f_k(x_i)}\right)^2 \\ =&\sum_i f_j(x_i)^2 p_i -\left(\sum_i f_j(x_i) p_i \right)^2\\ =& \langle f_j(x)^2\rangle - \langle f_j(x)\rangle^2\\ =&\text{var}(x). \end{align*}