This section gives a bit of theory background about the multivariate normal distribution, the goal is to have all formulae used in the MultiNorm methods documented here.

The Wikipedia normal distribution and the Wikipedia multivariate normal pages contain most of what we use, so do most textbooks that cover the multivariate normal distribution.

Here we just give the results, we don’t derive them.


The multivariate normal distribution has very nice mathematical properties. Most operations are given by linear algebra. Every derived quantity follows either again a multivariate normal distribution or a chi-squared distribution.

Probability density

The (univariate) normal distribution \(x \sim N(\mu, \sigma^2)\) for a variable \(x\) with mean \(\mu\) and variance \(\sigma^2\) has the probability density function (pdf)

\[f(x) = \frac{1}{\sqrt{(2 \pi) \sigma^2}} \exp\left( -\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2}\right).\]

The variance \(\Sigma\) is the square of the standard deviation \(\sigma\):

\[\Sigma = \sigma ^ 2\]

The multivariate normal distribution \(x \sim N(\mu, \Sigma)\) for a variable \(x\) (vector of length N), with mean \(\mu\) (vector of length N) and covariance matrix \(\Sigma\) (square symmetric matrix of shape (N, N)) has the pdf:

\[f(x) = \frac{1}{\sqrt{(2 \pi)^N \det \Sigma}} \exp\left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right),\]

Note that the univariate distribution is a special case of the multivariate distribution for N=1, one just has to write \((x - \mu)^2 / \sigma^2\) as \((x - \mu)^T \Sigma^{-1} (x - \mu)\) in the form that works for vectors \(x\) and \(\mu\) and a matrix \(\Sigma\).

The inverse of the covariance matrix \(\Sigma\) is called the precision matrix \(Q\):

\[Q = \Sigma^{-1}\]

For the one-dimensional distribution the precision is the inverse variance, \(Q = 1 / \sigma^2\), i.e. the precision is large if the variance and standard deviation are small.


Note that for a measurement \(\mu \pm \sigma\) where \(sigma\) represents the parameter error, the covariance matrix contains the squared errors \(\Sigma_{ii} = \sigma_i ^2\) on the diagonal, and to obtain the error from the covariance matrix one uses \(\sigma_i = \sqrt{\Sigma_ii}\).

Sometimes the covariance matrix is also called “error matrix”. We avoid that term because it is confusing, given that the matrix doesn’t contain errors, but the squared errors on the diagonal.

Sometimes also the term “variance-covariance matrix” is used instead of just “covariance matrix”. This is accurate: the matrix does contain the parameter variances on the diagonal, and the off-diagonal terms are the covariances.

Compute errors

If we define the “fit statistic” function as minus two times the log likelihood:

\[s(x) = - 2 \log(f(x))\]

then we see that if \(f\) is a multinormal distribution, then \(s\) will be a parabola:

\[s(x) = s_0 + \left(\frac{x - \mu}{\sigma}\right)^2\]

with the best-fit statistic value \(s_0\) of

\[s_0 = TODO\]

The second derivative (called Hessian) \(\frac{d^2s}{dx^2}\) is given by:

\[H = \frac{d^2s}{dx^2}(x) = 2 / \sigma^2\]

So for a given “fit statistic” function, the covariance matrix can be computed via the inverse Hessian:

\[If you define a fit statistic function as :math:`s(x) = - 2 \log(f(x))`, then the Hessian :math:`H = \delta s / \del x \del y` is twir \Sigma = 2 H^{-1}\]


The covariance matrix is

Marginal distribution

The marginal distribution can be obtained with the MultiNorm.marginal method.

You can think of the marginal distribution as the distribution obtained by simply ignoring some of the parameters, or by “projecting” the \(N\)-dimensional distribution onto the lower-dimensional subspace of parameters of interest.

The marginal distribution of the multivariate normal is again a multivariate normal distribution.

It can be obtained simply by keeping only the parameters of interest in the mean vector and covariance matrix (drop the parameters that are being marginalised out).

See here.

Conditional distribution

The conditional distribution can be obtained with the MultiNorm.conditional method.

The conditional distribution is given by the “slice” in the \(N\)-dimensional distribution when fixing some of the parameters.

The conditional distribution of the multivariate normal is again a multivariate normal distribution.

It can be obtained by partitioning the mean \(\mu\) and covariance \(\Sigma\) of the \(N\)-dimensional distribution into two part, corresponding to the parameters that are fixed, and that are kept free.

The formulae to obtain the mean and covariance of the conditional distribution are given here.

Fix parameters

This method is used e.g. in MINUIT, see Section 1.3.1 here:

As far as I can tell, it gives the same results as conditional (see test_conditional_vs_fix).

TODO: work out the math of why that is the case and document it here.

Add note that for MVN the covar matrix for conditional doesn’t depend on parameter values.

TODO: document and make example in the analyse section using iminuit.

Stacked distribution

TODO: document MultiNorm.from_stack

Product distribution

TODO: improve this section:

We should give the full equations, the ones below are the special case for distributions without correlations.

The approximation we will use can be found in many textbooks, e.g. Section 5.6.1 stats book. given \(n\) Gaussian likelihood estimates with parameter estimates \(x_i\) and known parameter errors \(\sigma_i\):

\[p(\mu | {x_i}, {\sigma_i}),\]

if we define “weights” as inverse square of errors

\[w_i = 1 / \sigma_i^2, \sigma_i = 1 / \sqrt{w_i},\]

then the from_product maximum likelihood estimate error is given by (Equation 5.50):

\[\mu_0 = \frac{\sum{w_i x_i}}{\sum{w_i}}\]

and the from_product measurement parameter error is given by

\[w = \sum{w_i}.\]


For the one-dimensional normal distribution \(N(\mu, \sigma)\) the probability content within \(n * \sigma\) is given by roughly 68% for \(n=1\), 95% for \(n=2\) and 99.7% for \(n=3\).

What’s the equivalent for the \(N\)-dimensional normal distribution?

For a given mean \(\mu\) and covariance \(\Sigma\) and point \(p\) one can define a distance \(d\) via

\[d = \sqrt{(p - \mu)^T \Sigma^{-1} (p - \mu)}.\]

The set of equal-distance points is an ellipsoidal surface and has the property that all points on it have equal probability density. It is the equivalent of the distance \(d = (p - \mu) / \sigma\), i.e. the number of standard deviations \(d = n * \sigma\) from the mean.

However, the probability content for a given \(n * \sigma\) is lower for the \(N\)-dimensional distribution. It turns out that \(d^2\) has a \(\chi^2\) distribution with \(N\) degrees of freedom:

\[P(d^2) = \chi^2(d^2, N)\]

That means you can compute the probability content using scipy.stats.chi2 like this:

>>> import numpy as np
>>> from scipy.stats import chi2
>>> n_sigma = np.array([1, 2, 3])
>>> chi2.cdf(n_sigma ** 2, 1)
array([0.68268949, 0.95449974, 0.9973002 ])
>>> chi2.cdf(n_sigma ** 2, 2)
array([0.39346934, 0.86466472, 0.988891  ])
>>> chi2.cdf(n_sigma ** 2, 3)
array([0.19874804, 0.73853587, 0.97070911])

Note that the 1 sigma ellipse in 2D has probability content 39%, in 3D it’s only 20%, and it gets smaller and smaller for higher dimensions.

Also see

For further information see the Wikipedia Mahalanobis distance page.

The MultiNorm.to_matplotlib_ellipse takes an n_sigma option, and will return an ellipse that matches the points with Mahalanobis distance \(d^2 = n * \sigma\).

See also sigma in the docs.