7.8. Gaussian Process Classification

Gaussian process are not only used for regression, but also for classification problems. Gaussian process classification uses a latent regression model to predict class probabilities. In a first step, we discuss binary classification and afterwards the setting is generalized to multiclass classification.

7.8.1. Binary Classification

In binary classification the data \(\mathcal{D}\) is of the form

\[\mathcal{D} = \{ (x_i, y_i)~|~x_i \in \mathbb{R}^d, y_i \in \{-1, 1\} \quad \text{for } i=1,\dots,n \},\]

i.e., the (discrete) label has either value \(-1\) or \(1\) representing the two possible classes. The sample matrix \(X\) and the lables \(y\) are constructed as before.

For a test point \(x^*\), the aim is to predict the class probabilities of the output \(y^*\), i.e., the probabilities \(p(y^*=1~|~X, y, x^*)\) and \(p(y^*=-1~|~X, y, x^*)\). In this way, we obtain one discrete probability distribution on \(\{-1, 1\}\) for each \(x^*\). Since

\[p(y^*=-1~|~X, y, x^*) = 1 - p(y^*=1~|~X, y, x^*),\]

it is sufficient to focus on \(p(y^*=1~|~X, y, x^*)\). In other words, we are looking for a model which returns for a given input \(x^*\) the probability that the corresponding label is \(1\). For example, \(x^*\) could be an image which shows either a cat or a dog. The model has to quantify the probability for dog and the complementary probability yields the probability for cat. Certainly, we desire values close to 0 or 1, since values near 0.5 imply that the model has difficulties to classify the input.

It holds \(\sigma(z) \in [0,1]\), where

(7.4)\[\sigma(z) := \frac{1}{1 + \exp(-z)} \quad \text{for } z \in \mathbb{R}\]

is the logistic response function.

%matplotlib inline
import numpy as np
from scipy.special import expit
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')

fig = plt.figure(figsize=(14, 10))
x = np.linspace(-8, 8, num=100)
plt.plot(x, expit(x), label='logistic sigmoid')
leg = plt.legend(loc='lower right')
../../_images/08_GPclassification_1_0.png

Hence, the value of \(\sigma(z)\) can be interpreted as a probability. It is also possible to use other response functions such as the Gaussian cdf, but we will restrict to \(\sigma\).

Our aim is to construct the distribution of a latent variable \(f^*\) given the data \(\mathcal{D}\) and a test point \(x^*\) (denote the pdf by \(p(f^*~|~X, y, x^*)\)) such that the class probability can be computed by the expectation of \(p(y^*=1~|~f^*) := \sigma(f^*)\):

(7.5)\[\begin{split} p(y^*=1~|~X, y, x^*) &= \int_{\mathbb{R}} ~ p(y^*=1~|~f^*)~p(f^*~|~X, y, x^*)~df^* \\ &= \int_{\mathbb{R}} ~ \sigma(f^*)~p(f^*~|~X, y, x^*)~df^*. \end{split}\]

Please note that \(p(y^*=-1~|~f^*) = 1 - p(y^*=1~|~f^*) = 1 - \sigma(f^*) = \sigma(-f^*)\) due to the properties of the logistic response function. Thus, the notation \(p(y^*~|~f^*) = \sigma(y^* f^*)\) is also used. Moreover, we like to mention that \(f^* < 0\) suggests that \(y^* = -1\) is more likely and vice versa \(f^* > 0\) implies that \(y^* = 1\) is more likely.

How can we construct \(p(f^*~|~X, y, x^*)\) in (7.5) by means of Gaussian process regression?

The answer to this question is rather complicated and technical. For the interested reader, we give the details in the subsequent part:

In application, the computational process can be summarized as follows:

  1. Choose a Gaussian process with hyperparemeters \(\theta\) as prior for the latent labels \(f\)

  2. For fixed \(\theta\) set \(\Psi(f) := \ln \big(p(y~|~f) ~ p(f~|~X, \theta)\big)\)

  3. Compute the root \(\hat{f}\) (depends on \(\theta\)) of \(\nabla \Psi\) in use of Newton’s method

  4. Calculate the log marginal likelihood by

    \[\begin{split}\ln p(y~|~X, \theta) &\approx \ln \big(p(y~|~\hat{f})\big) - \frac{1}{2} \hat{f}^T K_{\theta}(X,X)^{-1} \hat{f} - \frac{1}{2} \ln \big(|K_{\theta}(X,X)|\big) + \frac{1}{2} \ln\big(|\Sigma_{\hat{f}}|\big)\\ &=\ln \big(p(y~|~\hat{f})\big) - \frac{1}{2} \hat{f}^T K_{\theta}(X,X)^{-1} \hat{f} - \frac{1}{2} \ln \big(|I_n + W(\hat{f})^{\frac{1}{2}} K_{\theta}(X,X) W(\hat{f})^{\frac{1}{2}}|\big),\end{split}\]

    where \(W(\hat{f}) = - \nabla^2 \ln \big(p(y~|~\hat{f})\big)\), \(\Sigma_{\hat{f}} = \big(W(\hat{f}) + K_{\theta}(X, X)^{-1}\big)^{-1}\) and

    \[\ln \big(p(y~|~\hat{f})\big) = - \sum_{i=1}^n \ln \big(1 + \exp(-y_i\hat{f}_i)\big)\]
  5. Maximize \(\ln \big(p(y~|~X, \theta)\big)\) with respect to \(\theta\) (which means application of an optimization algorithm and looping steps 2.-4.)

Then, the class probability for a test point \(x^*\) in (7.5) is given by the one dimensional integral

\[\begin{split}p(y^*=1~|~X, y, x^*) &= \int_{\mathbb{R}} ~ \sigma(f^*)~p(f^*~|~X, y, x^*)~df^* \\ &\approx \int_{\mathbb{R}} ~ \sigma(f^*)~q(f^*~|~X, y, x^*)~df^*,\end{split}\]

where \(q(f^*~|~X, y, x^*)~df^*\) is the density of a normal distribution with mean

\[\mathbb{E}(f^*~|~X, y, x^*) = k(x^*, X) K(X, X)^{-1} \hat{f}\]

and variance

\[\text{var}(f^*~|~X, y, x^*) = k(x^*, x^*) - K(x^*, X)\big(W(\hat{f})^{-1} + K(X,X)\big)^{-1} K(X, x^*)\]

Thus, the assigned class probability is the expectation of \(\sigma\) with respect to the posterior distribution of the latent variable \(f^*\). If we are not interested in the precise value of the probability, but only in the favored class, it is sufficient to consider the sign of \(\mathbb{E}(f^*~|~X, y, x^*)\). For negative values the predicted class is \(-1\) and for positive values the predicted class is \(1\). This is equivalent to computing \(\sigma(\mathbb{E}(f^*~|~X, y, x^*))\) and to conclude that the predicted class is \(-1\) for a value less than \(0.5\) and the predicted class is \(1\) for a value larger than \(0.5\).

7.8.2. Multi-class Classification

In binary classification the data \(\mathcal{D}\) is of the form

\[\mathcal{D} = \{ (x_i, y_i)~|~x_i \in \mathbb{R}^d, y_i \in \{-1, 1\} \quad \text{for } i=1,\dots,n \},\]

i.e., the (discrete) label has either value \(-1\) or \(1\) representing the two possible classes. In the multi-class setting the data is of the form

\[\mathcal{D} = \{ (x_i, y_i)~|~x_i \in \mathbb{R}^d, y_i \in \{C_1, \dots, C_k \} \quad \text{for } i=1,\dots,n \},\]

where \(C_1, \dots, C_k\) represent the \(k\) different classes.

Similarly to the multi-output case in Gaussian process regression, several different approaches exist for multi-class classification. First of all, the Laplace approximtion discussed in Binary Classification can be generalized to a multi-class Laplace approximation (refer to Section 3.5 in [1]). Moreover, there exist general approaches to extend binary classifiers to multi-class tasks:

  • One-vs-Rest / One-vs-All:

    • \(k\) binary classifiers are trained by creating binary labels with the classes \(C_k\) (encoded as \(1\)) and all remaining classes (encoded as \(-1\))

    • For a test point each classifier predicts the probability for class \(C_k\) and the class with the highest probability is assigned

  • One-vs-One:

    • A binary problem is created by considering only two distinct classes \(C_k\) and \(C_l\), \(k \ne l\), (each possible pair of classes) and fitting a binary classifier (results in \(\frac{k~(k-1)}{2}\) classifiers)

    • Each model predicts a class probability and the class with most votes or alternatively, the class with the highest sum of probabilities is assigned

1(1,2)

C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2nd edition, 2006. URL: http://www.gaussianprocess.org/gpml/.