Reading Notes on
A Tutorial on Principal Component Analysis

Wenjia Bai

September 9, 2007

Title: A Tutorial on Principal Component Analysis
Author: Jonathon Shlens

1 The question

Given a data set X = {x1,x2,,xn}∈ ℝm, where n is the number of samples, m is the dimension, how can we find a new basis, which best expresses the original data set?

Let P be the linear transformation matrix to the new basis, the data set expressed by the new basis is,

Y =  PX

How do we define being a good basis?

A good basis should correspond to the directions with largest variances, because we regard that the variances contain the dynamics of interest. Also, the dynamics along different directions should be decorrelated. These directions with largest variances are called principal components.

2 The mathematics

How to mathematically describe being a good basis? After subtracting the means from each dimension of the data, we define a covariance matrix Cx,

      --1---   T
Cx =  n - 1XX
(1)

Cx is a m-by-m matrix. The diagonal element, the iith element, is the variance of ith element of the data. The off-diagonal element, the ijth (ij) element of Cx is the covariance between the ith and the jth elements of the data.

If the basis is good, then the covariance matrix Cx is diagonal, so that we can easily select the principal components according to the variances and the dynamics along different directions are decorrelated.

3 Solution

A symmetric matrix can be diagonalized by an orthogonal matrix of its eigenvectors. Since Cx is a symmetric matrix, it can be diagonalized,

           T        T
Cx = P ΣP     or  P  CxP  =  Σ
(2)

where the column vectors of P are the eigenvectors, Σ = diag(σ12,m).

We can derive that,

Σ  = --1---P TX (PT X )T
     n - 1
where Σ is the new covariance matrix, PT is the linear transformation matrix, the eigenvectors {p1,p2,,pm} form the new basis, and Y = PT X is the new data set.

We can then select the directions with the largest eigenvectors as the principal components.

4 Relationship with SVD

In singular value decomposition (SVD), we can decompose a m-by-n matrix X into the product of a m-by-m orthogonal matrix U, a diagonal matrix Σ, and the transpose of a n-by-n orthogonal matrix V ,

           T
X  = U ΣV

Applying SVD to Equation (1),

         --1---    T       T T
Cx   =   n - 1U ΣV   (U ΣV   )
           1
     =   -----U Σ2U T                          (3 )
         n - 1

Compare it with Equation (2), we can see that the orthogonal matrix U is exactly P, and the diagonal matrix in Equation (3) divided by (n - 1) is equal to that in Equation (2). The new data set is Y = UT X = ΣV T .

Therefore SVD is another solution for PCA. In fact, SVD and PCA are so intimately related that the names are often interchangeable. They both reveal the structure of the matrix, or the data set, X. That is why they are abundantly used in many forms of data analysis, as a simple non-parametric method for extracting relevant information from confusing data sets.

5 Strength and weakness

Both the strength and weakness of PCA is that it is a non-parametric analysis. On one side, there are no parameters to estimate or adjust. The answer is data-driven, not dependent on the user. On the other side, it does not take into account any a-priori knowledge, as the parametric algorithms do.

6 Note

6.1 Zero mean

The means of each dimension should be subtracted first, i.e. that the data set is zero means, so that Equation (1) can represent the covariance matrix.

6.2 Variance characterizes the dynamics

When PCA selects the principal components, it assumes that the mean is zero and the variance of the data characterizes the dynamics of the signal. The mean and variance are statistically sufficient only for exponential distributions, e.g. Gaussian, exponential, etc.

6.3 Orthogonal basis

The principal components are orthogonal. However, the orthogonal basis is not always the best description for the data set.