BOBOBK

The Mathematical Principles Behind PCA and Python Example Demonstration

TECHNOLOGY

This article introduces a very important dimensionality reduction method in machine learning—Principal Component Analysis.

Introduction to PCA

In multivariate statistical analysis, Principal Component Analysis (PCA) is a statistical method used to simplify datasets. It uses orthogonal transformations to linearly transform a set of possibly correlated variables into a set of linearly uncorrelated variables, known as principal components (PCs). Specifically, each principal component can be regarded as a linear equation with a series of coefficients indicating the direction of the projection. PCA is sensitive to normalization or preprocessing (relative scaling) of the raw data.

The core idea of PCA is to reduce the dimensionality of datasets composed of many interrelated variables in a sparse matrix while retaining as much variability as possible. This is achieved by converting a set of correlated and ordered main variables (PCs) into a new set of variables so that the first few variables retain most of the variation present in the original variables, thereby achieving a lower-dimensional dataset.

The Mathematical Principles Behind PCA and Python Demonstration

Next, we’ll go through the concrete operations behind PCA. PCA can be viewed as an unsupervised learning problem. The entire process of obtaining principal components from the original dataset can be divided into six parts:

  • Convert the dataset from (d+1)-dimensional to d-dimensional by removing labels.
  • Calculate the mean of each dimension of the dataset.
  • Compute the covariance matrix of the dataset.
  • Compute eigenvectors and corresponding eigenvalues.
  • Sort the eigenvectors by decreasing eigenvalues and select the top k eigenvectors to form a d×k matrix W.
  • Use this d×k eigenvector matrix to project the samples into a new subspace.

We will explain each step one by one:

Convert the (d+1)-dimensional dataset into a d-dimensional dataset by removing labels

Assume we have a (d+1)-dimensional dataset. In modern machine learning terminology, this can be viewed as X_train with y_train as the labels. After removing the labels, we are left with a d-dimensional dataset, which will be used for PCA.

Assume d = 3, and we are left with a 3D dataset. We’ll assume the samples come from two categories, with half labeled as class 1 and the other half as class 2.

Let’s say our data matrix score consists of student scores in Chinese, Math, and English. category refers to the student type. The matrix is as follows:

``python import numpy as np import pandas as pd

Student score matrix

stu_score = np.matrix([[90,60,80,0],[90,100,40,0],[80,90,70,0],[60,60,60,1],[70,60,50,1],[70,50,30,1]]) stu_score

matrix([[ 90, 60, 80, 0],

[ 90, 100, 40, 0],

[ 80, 90, 70, 0],

[ 60, 60, 60, 1],

[ 70, 60, 50, 1],

[ 70, 50, 30, 1]])

Score matrix

score = stu_score[:,:3]

Student categories

category = stu_score[:,3] ``

Calculate the mean of each dimension of the dataset

The data above can be represented as matrix A, with each column representing a subject score and each row representing a student. We use NumPy’s mean to compute the mean values:

``python pri_mean = np.mean(score, axis=0) pri_mean

matrix([[76.66666667, 70. , 55. ]])

``

Compute the covariance matrix of the dataset

We use the following formula to compute the covariance between two variables X and Y:

Using NumPy’s cov function to calculate:

``python cov_matrix = np.cov(score.T) cov_matrix

array([[146.66666667, 140. , 60. ],

[140. , 400. , 20. ],

[ 60. , 20. , 350. ]])

``

Notes:

  1. The diagonal represents the variance of each subject—Math has the highest variance, Chinese the lowest.
  2. The covariance between Math and English is smallest, indicating they are the least correlated.

Compute eigenvectors and corresponding eigenvalues

Eigenvectors retain their direction during linear transformations. For a square matrix A, a vector v and scalar λ satisfy:

Aν = λν,
where λ is the eigenvalue and v is the eigenvector.

We compute them using NumPy:

``python eigenvalue, eigenvector = np.linalg.eig(np.cov(score.T)) eigenvalue

array([ 76.41469158, 477.07841359, 343.17356149])

eigenvector

array([[-0.90804625, 0.4184661 , 0.01838809],

[ 0.38228723, 0.84588381, -0.3719369 ],

[ 0.17119717, 0.33070638, 0.92807587]])

``

Sort and select top k eigenvectors to form a d×k matrix W

To reduce dimensionality, we sort the eigenvectors by their corresponding eigenvalues in descending order, then select the top k:

python idx = np.argsort(eigenvalue)[::-1] eigenvector = eigenvector[:, idx[:2]] eigenvalue = eigenvalue[idx[:2]]

Use the d×k eigenvector matrix to transform samples into the new subspace

Use matrix dot product to project data into the new subspace:

``python new_matrix = np.dot(score, eigenvector) new_matrix

matrix([[114.87148787, 53.58478318],

[135.47858486, 1.58427232],

[132.75627721, 32.96203655],

[ 95.70337722, 34.4716232 ],

[ 96.58097443, 25.37474538],

[ 81.50800877, 10.53259702]])

``

Summary

Above, we used NumPy to manually compute PCA and map to a new subspace. In practice, scikit-learn provides a much simpler PCA interface. Below is a comparison using both methods:

``python import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import decomposition from sklearn import datasets from sklearn.preprocessing import StandardScaler

score = np.matrix([[90, 60, 80], [90, 100, 40], [80, 90, 70], [60, 60, 60], [70, 60, 50], [70, 50, 30]])

Raw NumPy version

x_std = StandardScaler().fit_transform(score) eigenvalue, eigenvector = np.linalg.eig(np.cov(x_std.T)) idx = np.argsort(eigenvalue)[::-1] eigenvector = eigenvector[:, idx[:2]] eigenvalue = eigenvalue[idx[:2]] new_matrix = np.dot(x_std, eigenvector) -new_matrix

array([[-0.97429996, -1.49031278],

[-1.59492786, 1.54333152],

[-1.19990275, -0.32980079],

[ 1.29979933, -0.57553885],

[ 0.86561934, 0.00681518],

[ 1.60371189, 0.84550571]])

sklearn version

pca = decomposition.PCA(n_components=2) pca.fit_transform(x_std)

array([[-0.97429996, -1.49031278],

[-1.59492786, 1.54333152],

[-1.19990275, -0.32980079],

[ 1.29979933, -0.57553885],

[ 0.86561934, 0.00681518],

[ 1.60371189, 0.84550571]])

``

To better understand the PCA process, first study the raw NumPy version. If you just need the result, the sklearn version is more convenient.

Related