This article introduces a very important dimensionality reduction method in machine learning—Principal Component Analysis.
Introduction to PCA
In multivariate statistical analysis, Principal Component Analysis (PCA) is a statistical method used to simplify datasets. It uses orthogonal transformations to linearly transform a set of possibly correlated variables into a set of linearly uncorrelated variables, known as principal components (PCs). Specifically, each principal component can be regarded as a linear equation with a series of coefficients indicating the direction of the projection. PCA is sensitive to normalization or preprocessing (relative scaling) of the raw data.
The core idea of PCA is to reduce the dimensionality of datasets composed of many interrelated variables in a sparse matrix while retaining as much variability as possible. This is achieved by converting a set of correlated and ordered main variables (PCs) into a new set of variables so that the first few variables retain most of the variation present in the original variables, thereby achieving a lower-dimensional dataset.
The Mathematical Principles Behind PCA and Python Demonstration
Next, we’ll go through the concrete operations behind PCA. PCA can be viewed as an unsupervised learning problem. The entire process of obtaining principal components from the original dataset can be divided into six parts:
- Convert the dataset from (d+1)-dimensional to d-dimensional by removing labels.
- Calculate the mean of each dimension of the dataset.
- Compute the covariance matrix of the dataset.
- Compute eigenvectors and corresponding eigenvalues.
- Sort the eigenvectors by decreasing eigenvalues and select the top k eigenvectors to form a d×k matrix W.
- Use this d×k eigenvector matrix to project the samples into a new subspace.
We will explain each step one by one:
Convert the (d+1)-dimensional dataset into a d-dimensional dataset by removing labels
Assume we have a (d+1)-dimensional dataset. In modern machine learning terminology, this can be viewed as X_train
with y_train
as the labels. After removing the labels, we are left with a d-dimensional dataset, which will be used for PCA.
Assume d = 3, and we are left with a 3D dataset. We’ll assume the samples come from two categories, with half labeled as class 1 and the other half as class 2.
Let’s say our data matrix score
consists of student scores in Chinese, Math, and English. category
refers to the student type. The matrix is as follows:
``python import numpy as np import pandas as pd
Student score matrix
stu_score = np.matrix([[90,60,80,0],[90,100,40,0],[80,90,70,0],[60,60,60,1],[70,60,50,1],[70,50,30,1]]) stu_score
matrix([[ 90, 60, 80, 0],
[ 90, 100, 40, 0],
[ 80, 90, 70, 0],
[ 60, 60, 60, 1],
[ 70, 60, 50, 1],
[ 70, 50, 30, 1]])
Score matrix
score = stu_score[:,:3]
Student categories
category = stu_score[:,3] ``
Calculate the mean of each dimension of the dataset
The data above can be represented as matrix A, with each column representing a subject score and each row representing a student. We use NumPy’s mean
to compute the mean values:
``python pri_mean = np.mean(score, axis=0) pri_mean
matrix([[76.66666667, 70. , 55. ]])
``
Compute the covariance matrix of the dataset
We use the following formula to compute the covariance between two variables X and Y:
Using NumPy’s cov
function to calculate:
``python cov_matrix = np.cov(score.T) cov_matrix
array([[146.66666667, 140. , 60. ],
[140. , 400. , 20. ],
[ 60. , 20. , 350. ]])
``
Notes:
- The diagonal represents the variance of each subject—Math has the highest variance, Chinese the lowest.
- The covariance between Math and English is smallest, indicating they are the least correlated.
Compute eigenvectors and corresponding eigenvalues
Eigenvectors retain their direction during linear transformations. For a square matrix A, a vector v and scalar λ satisfy:
Aν = λν,
where λ is the eigenvalue and v is the eigenvector.
We compute them using NumPy:
``python eigenvalue, eigenvector = np.linalg.eig(np.cov(score.T)) eigenvalue
array([ 76.41469158, 477.07841359, 343.17356149])
eigenvector
array([[-0.90804625, 0.4184661 , 0.01838809],
[ 0.38228723, 0.84588381, -0.3719369 ],
[ 0.17119717, 0.33070638, 0.92807587]])
``
Sort and select top k eigenvectors to form a d×k matrix W
To reduce dimensionality, we sort the eigenvectors by their corresponding eigenvalues in descending order, then select the top k:
python idx = np.argsort(eigenvalue)[::-1] eigenvector = eigenvector[:, idx[:2]] eigenvalue = eigenvalue[idx[:2]]
Use the d×k eigenvector matrix to transform samples into the new subspace
Use matrix dot product to project data into the new subspace:
``python new_matrix = np.dot(score, eigenvector) new_matrix
matrix([[114.87148787, 53.58478318],
[135.47858486, 1.58427232],
[132.75627721, 32.96203655],
[ 95.70337722, 34.4716232 ],
[ 96.58097443, 25.37474538],
[ 81.50800877, 10.53259702]])
``
Summary
Above, we used NumPy to manually compute PCA and map to a new subspace. In practice, scikit-learn provides a much simpler PCA interface. Below is a comparison using both methods:
``python import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import decomposition from sklearn import datasets from sklearn.preprocessing import StandardScaler
score = np.matrix([[90, 60, 80], [90, 100, 40], [80, 90, 70], [60, 60, 60], [70, 60, 50], [70, 50, 30]])
Raw NumPy version
x_std = StandardScaler().fit_transform(score) eigenvalue, eigenvector = np.linalg.eig(np.cov(x_std.T)) idx = np.argsort(eigenvalue)[::-1] eigenvector = eigenvector[:, idx[:2]] eigenvalue = eigenvalue[idx[:2]] new_matrix = np.dot(x_std, eigenvector) -new_matrix
array([[-0.97429996, -1.49031278],
[-1.59492786, 1.54333152],
[-1.19990275, -0.32980079],
[ 1.29979933, -0.57553885],
[ 0.86561934, 0.00681518],
[ 1.60371189, 0.84550571]])
sklearn version
pca = decomposition.PCA(n_components=2) pca.fit_transform(x_std)
array([[-0.97429996, -1.49031278],
[-1.59492786, 1.54333152],
[-1.19990275, -0.32980079],
[ 1.29979933, -0.57553885],
[ 0.86561934, 0.00681518],
[ 1.60371189, 0.84550571]])
``
To better understand the PCA process, first study the raw NumPy version. If you just need the result, the sklearn version is more convenient.