Essential Theorems
Contents
1.5. Essential Theorems¶
1.5.1. Properties of the Normal Distribution¶
Lemma
Let \(X \sim \mathcal{N}(\mu_1, \Sigma_1)\) be a \(\mathbb{R}^{d_1}\)-valued random variable as well as \(\mu_2 \in \mathbb{R}^{d_2}\) and \(A \in \mathbb{R}^{d_2 \times d_1}\) be a matrix with full rank. Then,
In particular, \(\mu_2 + AX \sim \mathcal{N}(\mu_2, AA^T)\) if \(X\) is standard normally distributed.
As a consequence, each normally distributed random variable \(X \sim \mathcal{N}(\mu, \Sigma)\) can be written as the linear transformation of a standard normally distributed random variable \(Z\):
where \(A\) is the matrix root of \(\Sigma\), i.e., \(\Sigma = AA^T\).
In many cases, the distribution of a sum of random variables is not known explicitly. Luckily, independent normally distributed random variables behave nicely in this regard:
Lemma
Let \(X_1 \sim \mathcal{N}(\mu_1, \Sigma_1)\) and \(X_2 \sim \mathcal{N}(\mu_2, \Sigma_2)\) be \(\mathbb{R}^{d}\)-valued independent normal distributed random variables. Then, the sum \(X_1 + X_2\) is also normally distributed with
If \(X \sim \mathcal{N}(\mu, \Sigma)\) is a multivariate normally distributed random vector, the marginal distribution of some subvector is again a normal distribution and can simply be obtained by restriction of the mean and covariance to the relevant components. For example, for a \(3\)-dimensional random vector
with
the subvector
is again normally distributed with mean and covariance given by
Due to the definition of the pdf of a multivariate normal distribtion and the properties of the exponential function, we get the following lemma. We restrict to the bivariate case, but in view of the preceding considerations the according result holds also true for pairwise independence of general multivariate normal distributions.
Lemma
Let \(X = (X_1, X_2) \sim \mathcal{N}(\mu, \Sigma)\) be a random vector such that \(X_1\) and \(X_2\) are uncorrelated (i.e., \(\Sigma\) is a diagonal matrix). Then, \(X_1\) and \(X_2\) are independent random variables.
In Independence we mentioned that uncorrelated random variables are not necessarily independent, but the preceding lemma shows that this is different for normally distributed random variables provided that the joint distribution is also Gaussian. If \(X_1\) and \(X_2\) are normally distributed, but \(X\) is not, the statement does not hold true!
In view of the subsequent applications to Gaussian process regression, the following result will be very useful:
Lemma
Let \(X \sim \mathcal{N}(\mu, \Sigma)\) be \(d\)-dimensional and consider a partition of \(X\) into two subvectors
where \(X_1\) is \(d_1\)-valued and \(X_2\) is \(d_2\)-valued such that \(d_1 + d_2 = d\). Accordingly, the mean and covariance are partitioned as follows
and
Then, the conditional distribution density of \(X_1\) given \(X_2=x_2\) is the density of a normal distribution with mean and covariance given by
and
For the interested reader, we show the result for the bivariate case \(d=2\) and \(d_1=d_2=1\).
Proof.
Let the mean value be given by
and
where \(\sigma_1^2\) is the variance of \(X_1\), \(\sigma_2^2\) is the variance of \(X_2\) and \(\sigma_{12}\) is the covariance of \(X_1\) and \(X_2\).
Note that \(\sigma_{12} = \rho \sigma_1 \sigma_2\), where \(\rho\) is the correlation of \(X_1\) and \(X_2\) (refer to the section Random Variables). Hence, the aim is to show that the conditional distribution is normally distributed with mean
and variance
According to the definition, the conditional distribution of \(X_1\) given \(X_2 = x_2\) is given by
Note that \(X_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)\) and thus,
Moreover, it holds by assumption
With \(|\Sigma| = \sigma_1^2 \sigma_2^2 - \sigma_{12}^2 = (1 - \rho^2) \sigma_1^2 \sigma_2^2\) and
it follows that
Consequently, it holds
i.e. the conditional distribution is a normal distribution with mean \(\mu_1 + \rho \frac{\sigma_1}{\sigma_2} \big(x_2 - \mu_2 \big)\) and variance \((1 - \rho^2) \sigma_1^2\).
1.5.2. Law of Large Numbers¶
The law of large numbers has multiple (strong and weak) versions and is of particular importance in statistics, since it justifies for example the estimation of the expectation of random variables in terms of sample means. In this section, we state one version of the strong law of large numbers:
Theorem
Let \(X_1, X_2, \dots\) be a sequence of independent identical distributed random variables with expectation \(\mu\). Then, it exists an event \(N\) of probability zero such that
or equivalently
i.e., the average converges almost surely to the expectation.
1.5.3. Central Limit Theorem¶
The central limit theorem makes the normal distribution particularly important, since it can be considered as the limit of the average of “nice” i.i.d. random variables. Similarly to the law of large numbers, the theorem exists in several versions. In this section, we state the Lindeberg-Lévy central limit theorem:
Theorem
Let \(X_1, X_2, \dots\) be a sequence of independent identical distributed random variables with expectation \(\mu\) and \(0 < \sigma^2 < \infty\). Set
Then, it holds
where \(\Phi\) denotes the cumulative distribution function of the standard normal distribution.
Briefly speaking, the cumulative distribution function of standardized average \(Z_n\) converges pointwisely to the cumulative distribution function of the standard normal distribution \(\mathcal{N}(0, 1)\). By definition, this means that \(Z_n\) converges in distribution to the standard normal distribution. The definition of \(Z_n\) might seem confusing, but it simply scales the average \(\frac{1}{n} ~\sum_{i=1}^n X_i\) such that its mean is \(0\) and its variance is \(1\) (in accordance with \(\mathcal{N}(0, 1)\)).
This result is really notable, since it is independent from the underlying distribution of the random variables \(X_i\), \(i \in \mathbb{N}\), which could be totally different from a normal distribution and possibly be a discrete distribution.
The arithmetic average fulfills
Thus, the average can be approximated by \(\mathcal{N}(\mu, \frac{\sigma^2}{n})\) for sufficiently large \(n\). If the distribution of the variables \(X_1, X_2, \dots\) is \(\mathcal{N}(\mu, \sigma^2)\), it holds indeed that \(\frac{1}{n} ~\sum_{i=1}^n X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})\).
1.5.4. Bayes’ Theorem¶
From a mathematical standpoint, Bayes’ theorem is a rather simple, since the statement is follows directly from the definition of conditional probabilities. Nevertheless, it has a very important interpretation which is the foundation of Bayesian inference.
Theorem
Let \((\Omega, \mathcal{F}, P)\) be a probability space and \(A, B \in \mathcal{F}\) with \(P(B) > 0\). Then
Note that \(P(B~|~A)\) is not well-defined if \(P(A) = 0\), but in this case it holds \(P(A~|~B) = 0\) and the righthand side can also be regarded as \(0\), since \(P(A) = 0\) and the (ill-defined) \(P(B~|~A)\) should be between \(0\) and \(1\).
The events \(A\) and \(B\) are often denoted by \(H\) and \(E\), respectively, where \(H\) denotes the hypothesis and \(E\) denotes the evidence. Hence, Bayes’ theorem states a way to calculate the probability of some hypothesis \(H\) given some data (the evidence) \(E\). In use of the law of total probability stated in Conditional Probability Bayes’ theorem reads
\(P(H)\) is called the prior probability of the hypothesis, \(P(E)\) the marginal probability, \(P(E~|~H)\) the likelihood of the evidence given the hypothesis and \(P(H~|~E)\) the posterior probability of the hypothesis given the evidence.
A well-known example for the application of Bayes’ theorem is a antigen test for a SARS-CoV-2 coronavirus infection. In this setting, the hypothesis \(H\) is that some tested person is indeed infected and the evidence is given by a positive antigen test. \(H^c\) is complementary event of \(H\), i.e., the person is not infected. In order to apply Bayes’ theorem, we use the following information:
the sensitivity \(P(\text{positive}~|~\text{infected})\) of the test is \(96.5\%\)
the specificity \(P(\text{negative}~|~\text{not infected})\) of the test is \(99.7\%\)
the prior probability \(P(\text{infected})\) of an infection is \(0.1\%\) which corresponds to 100 infected persons per 100000 inhabitants
Bayes’ theorem yields
The concept is also demonstrated in the following video:
Bayes’ theorem can also be formulated in terms of conditional distributions:
Let \(X\) and \(Y\) be two continuous random variables with joint density \(f_{X, Y}\). Then, it holds
More informally, this can be expressed as
where \(p\) is a shorthand notation for some probability density in analogy to the elementary probabilities in the case of discrete distributions.