Discriminant analysis
Published on 2019-01-03

Table of Contents

In this post I talk about the theory and implementation of linear and quadratic discriminant analysis, classical methods in statistical learning.

Acknowledgement. Various sources were of great help to my understanding of the subject, including Chapter 4 of The Elements of Statistical Learning, Stanford CS229 Lecture notes, and the scikit-learn code. Research was done while working at KTH mathematics department.

If you are reading on a mobile device, you may need to "request desktop site" for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.

Theory

Quadratic discriminant analysis (QDA) is a classical classification algorithm. It assumes that the data is generated by Gaussian distributions, where each class has its own mean and covariance.

\[(x | y = i) \sim N(\mu_i, \Sigma_i).\]

It also assumes a categorical class prior:

\[\mathbb P(y = i) = \pi_i\]

The log-likelihood is thus

\[\begin{aligned} \log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ &= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) \end{aligned}\]

where \(C\) and \(C'\) are constants.

Thus the prediction is done by taking argmax of the above formula.

In training, let \(X\), \(y\) be the input data, where \(X\) is of shape \(m \times n\), and \(y\) of shape \(m\). We adopt the convention that each row of \(X\) is a sample \(x^{(i)T}\). So there are \(m\) samples and \(n\) features. Denote by \(m_i = \#\{j: y_j = i\}\) be the number of samples in class \(i\). Let \(n_c\) be the number of classes.

We estimate \(\mu_i\) by the sample means, and \(\pi_i\) by the frequencies:

\[\begin{aligned} \mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ \pi_i &:= \mathbb P(y = i) = {m_i \over m}. \end{aligned}\]

Linear discriminant analysis (LDA) is a specialisation of QDA: it assumes all classes share the same covariance, i.e. \(\Sigma_i = \Sigma\) for all \(i\).

Guassian Naive Bayes is a different specialisation of QDA: it assumes that all \(\Sigma_i\) are diagonal, since all the features are assumed to be independent.

QDA

We look at QDA.

We estimate \(\Sigma_i\) by the variance mean:

\[\begin{aligned} \Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. \end{aligned}\]

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) are the centred \(x^{(j)}\). Plugging this into (0) we are done.

There are two problems that can break the algorithm. First, if one of the \(m_i\) is \(1\), then \(\Sigma_i\) is ill-defined. Second, one of \(\Sigma_i\)'s might be singular.

In either case, there is no way around it, and the implementation should throw an exception.

This won't be a problem of the LDA, though, unless there is only one sample for each class.

Vanilla LDA

Now let us look at LDA.

Since all classes share the same covariance, we estimate \(\Sigma\) using sample variance

\[\begin{aligned} \Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, \end{aligned}\]

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) and \({1 \over m - n_c}\) comes from Bessel's Correction.

Let us write down the decision function (0). We can remove the first term on the right hand side, since all \(\Sigma_i\) are the same, and we only care about argmax of that equation. Thus it becomes

\[- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)\]

Notice that we just walked around the problem of figuring out \(\log \det \Sigma\) when \(\Sigma\) is singular.

But how about \(\Sigma^{-1}\)?

We sidestep this problem by using the pseudoinverse of \(\Sigma\) instead. This can be seen as applying a linear transformation to \(X\) to turn its covariance matrix to identity. And thus the model becomes a sort of a nearest neighbour classifier.

Nearest neighbour classifier

More specifically, we want to transform the first term of (0) to a norm to get a classifier based on nearest neighbour modulo \(\log \pi_i\):

\[- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i\]

To compute \(A\), we denote

\[X_c = X - M,\]

where the $i$th row of \(M\) is \(\mu_{y_i}^T\), the mean of the class \(x_i\) belongs to, so that \(\Sigma = {1 \over m - n_c} X_c^T X_c\).

Let

\[{1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T\]

be the SVD of \({1 \over \sqrt{m - n_c}}X_c\). Let \(D_x = \text{diag} (s_1, ..., s_r)\) be the diagonal matrix with all the nonzero singular values, and rewrite \(V_x\) as an \(n \times r\) matrix consisting of the first \(r\) columns of \(V_x\).

Then with an abuse of notation, the pseudoinverse of \(\Sigma\) is

\[\Sigma^{-1} = V_x D_x^{-2} V_x^T.\]

So we just need to make \(A = D_x^{-1} V_x^T\). When it comes to prediction, just transform \(x\) with \(A\), and find the nearest centroid \(A \mu_i\) (again, modulo \(\log \pi_i\)) and label the input with \(i\).

Dimensionality reduction

We can further simplify the prediction by dimensionality reduction. Assume \(n_c \le n\). Then the centroid spans an affine space of dimension \(p\) which is at most \(n_c - 1\). So what we can do is to project both the transformed sample \(Ax\) and centroids \(A\mu_i\) to the linear subspace parallel to the affine space, and do the nearest neighbour classification there.

So we can perform SVD on the matrix \((M - \bar x) V_x D_x^{-1}\) where \(\bar x\), a row vector, is the sample mean of all data i.e. average of rows of \(X\):

\[(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.\]

Again, we let \(V_m\) be the \(r \times p\) matrix by keeping the first \(p\) columns of \(V_m\).

The projection operator is thus \(V_m\). And so the final transformation is \(V_m^T D_x^{-1} V_x^T\).

There is no reason to stop here, and we can set \(p\) even smaller, which will result in a lossy compression / regularisation equivalent to doing principle component analysis on \((M - \bar x) V_x D_x^{-1}\).

Note that as of 2019-01-04, in the scikit-learn implementation of LDA, the prediction is done without any lossy compression, even if the parameter n_components is set to be smaller than dimension of the affine space spanned by the centroids. In other words, the prediction does not change regardless of n_components.

Fisher discriminant analysis

The Fisher discriminant analysis involves finding an $n$-dimensional vector \(a\) that maximises between-class covariance with respect to within-class covariance:

\[{a^T M_c^T M_c a \over a^T X_c^T X_c a},\]

where \(M_c = M - \bar x\) is the centred sample mean matrix.

As it turns out, this is (almost) equivalent to the derivation above, modulo a constant. In particular, \(a = c V_m^T D_x^{-1} V_x^T\) where \(p = 1\) for arbitrary constant \(c\).

To see this, we can first multiply the denominator with a constant \({1 \over m - n_c}\) so that the matrix in the denominator becomes the covariance estimate \(\Sigma\).

We decompose \(a\): \(a = V_x D_x^{-1} b + \tilde V_x \tilde b\), where \(\tilde V_x\) consists of column vectors orthogonal to the column space of \(V_x\).

We ignore the second term in the decomposition. In other words, we only consider \(a\) in the column space of \(V_x\).

Then the problem is to find an $r$-dimensional vector \(b\) to maximise

\[{b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.\]

This is the problem of principle component analysis, and so \(b\) is first column of \(V_m\).

Therefore, the solution to Fisher discriminant analysis is \(a = c V_x D_x^{-1} V_m\) with \(p = 1\).

Linear model

The model is called linear discriminant analysis because it is a linear model. To see this, let \(B = V_m^T D_x^{-1} V_x^T\) be the matrix of transformation. Now we are comparing

\[- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k\]

across all $k$s. Expanding the norm and removing the common term \(\|B x\|^2\), we see a linear form:

\[\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k\]

So the prediction for \(X_{\text{new}}\) is

\[\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)\]

thus the decision boundaries are linear.

This is how scikit-learn implements LDA, by inheriting from LinearClassifierMixin and redirecting the classification there.

Implementation

This is where things get interesting. How do I validate my understanding of the theory? By implementing and testing the algorithm.

I try to implement it as close as possible to the natural language / mathematical descriptions of the model, which means clarity over performance.

How about testing? Numerical experiments are harder to test than combinatorial / discrete algorithms in general because the output is less verifiable by hand. My shortcut solution to this problem is to test against output from the scikit-learn package.

It turned out to be harder than expected, as I had to dig into the code of scikit-learn when the outputs mismatch. Their code is quite well-written though.

The result is here.

Fun facts about LDA

One property that can be used to test the LDA implementation is the fact that the scatter matrix \(B(X - \bar x)^T (X - \bar X) B^T\) of the transformed centred sample is diagonal.

This can be derived by using another fun fact that the sum of the in-class scatter matrix and the between-class scatter matrix is the sample scatter matrix:

\[X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).\]

The verification is not very hard and left as an exercise.