In this post, we introduce the math foundation behind principal component analysis, a simple technique for dimensionality reduction.
yang’s post (written in Chinese)
The efficiency of features
The high-level idea of principal component analysis is to reduce the dimensionality of a given dataset by transforming it into a new dataset in which all variables (features) are independent of each other. We denote this transformation as follow:
where is the original dataset with instances and variables, is the reduction matrix, and is the transformed dataset with variables. To better understand how information was stored in our dataset matrix , we can use the concept of variance and covariance of feature variables in dataset matrix . Let’s denote as a variable in the variable vector . The variance of is:
The magnitude of the variance can be used to indicate the volume of information embedded in this variable. A variable contains no information is a constant with zero variance. Covariance, as another measure, indicate the level of dependency between two variables:
To represent both variance and covariance in a matrix form, we can use the concept of covariance matrix. Let’s denote the covariance matrix of the random variables vector as :
Where the diagonal of are variance of each variable, and the non-diagonal elements are covariances of any two variables. Noted that if we normalized by making all , our transformed still contains the same information; therefore, our covariance matrix can be simplified as follow:
The objective of principal component analysis
From (3), we know that is the covariance matrix associated with the original dataset . For the transformed dataset , we can also obtain a covariance matrix using the same procedural as (4). Interestingly, has can be rewritten with only and :
Since is a covariance matrix that contains variances and covariances of the new feature variables vector, our objective is to find a such that any two random variables of this new feature variables vector have zero covariance. In addition, because has a smaller dimension than , we also want this new feature variables vector to retain as much information as possible from the original dataset. Given real values such that , we want D to be a diagonal covariance matrix:
This process of finding that turns D into a diagonal matrix is called diagonalization in linear algebra (see Diagonalizable matrix). It involves eigendecomposition of matrix , which happens to be a real symmetric matrix in this case. So if we apply eigendecomposition on , we get:
where is an orthogonal matrix.
Relationship with SVD
It’s also common to solve PCA problem with singular value decomposition. First, let’s apply SVD on :
For , we have:
Given (7) and (9), it’s not difficult to see that and , where is a constant. This means we can solve PCA by only applying SVD on , without calculating .