Advanced Machine Learning Techniques: Principal Component Analysis

By Camille D., Age 17

This article will focus on a method data scientists and programmers use to make data easier to explore, visualize, and interpret data, called principal component analysis (PCA). The explanations in this article assume some background in linear algebra and statistics.

PCA is based on dimensionality reduction: “the process of reducing the number of random variables under consideration by obtaining a set of principal variables,” in other words, transforming a large dataset into a smaller one without extracting too much key information. This process is considered expensive for machine learning algorithms; a little accuracy must be traded for simplicity. Minimizing this cost is part of the job for PCA.

The first step of PCA is standardization, the process that is the least mathematically involved. Standardization takes care of the variances within the initial variables, specifically with regards to their ranges. For example, the value of one variable may lie within the range of 0 to 10, and the value of another within the range of 0 to 1. The variable whose possible value lies between 0 and 10 will carry a greater weight over the second variable, leading to biased results. Mathematically, this can be addressed by subtracting the dataset’s mean from the value of the variable and dividing this result by the set’s standard deviation.

After standardization is performed, the values of each variable will all be within the same range.

Note that standardization is different from normalization in descriptive statistics. Normalization rescales the values into a range from 0 to 1, while standardization rescales the dataset to have a mean of 0 and a standard deviation of 1.

In almost any case, of course, this will yield a value smaller than 1.

The second step, covariance matrix computation, is where things unfortunately begin to get more complicated. We first must understand the definition of covariance: “a measure of how much two random variables vary together.”

**Covariance differs from correlation in that correlation describes how strongly two variables are related, while covariance indicates the extent to which two random variables change with one another. The values of covariance lie between -∞ and ∞, while the values of correlation lie between -1 and 1. Correlations can be obtained only when the data is standardized.

Covariance matrix computation aims to investigate how the variables in the input dataset are related to one another. This is important because it helps detect redundant information that may come from a high correlation between two elements. We compute a covariance matrix to determine these correlations. The covariance matrix is an nn matrix, where nis the number of dimensions, that has entries of all possible covariances within the dataset.

A couple notes:

Cov(x,x)=Var(x), or the variance of the initial variable.
The Cov()operator is commutative, so Cov(x,y)=Cov(y,x), so the upper and lower triangular portions of the matrix are equal.

The covariance matrix is simply an organization that lists the correlations between all possible pairs of variables. The sign of the value of the covariance is what tells us about the correlations between elements. If the covariance is positive, then the two variables are directly correlated. If the covariance is negative, the relationship between the two variables is an inverse correlation.

The next step in PCA is actually identifying the principal components by computing the eigenvectors and eigenvalues of the covariance matrix. However many principal components are produced from the dataset should be equal to the amount of dimensions in the set. Principal components are “combinations” or “mixtures” of the initial variables, and are constructed such that each of them are uncorrelated and as much information from the variability initial variables as possible is stored in the first component, and the succeeding components account for the remaining information, as shown in the example plot below for an 8-dimensional dataset:

This form helps significantly with dimensionality reduction because it eliminates components with little to no information while still retaining the information that describes the key relationships within the data. Consider the dataset below:

The direction first principal component line represents the direction of the highest variability in the data. Since the variability is the largest in the first component, the information captured by the first component is also the largest. It’s the line in which the projection of the points onto the line is the most spread out. This line maximizes the average of the squared distances from the projected points to the origin. The direction of the second principal component line should be orthogonal in order for the principal components to be completely uncorrelated.

We continue to calculate principal components n times, where n is the original number of values in the dataset.

Going back to eigenvectors and eigenvalues, here are a couple preliminary notes about eigenvectors and eigenvalues:

Every eigenvector has its own corresponding eigenvalue.
The number of eigenvectors and corresponding eigenvalues is equal to the number of dimensions/variables in the data.
For a tutorial on how to calculate the eigenvalues and eigenvectors of a matrix: https://www.scss.tcd.ie/~dahyotr/CS1BA1/SolutionEigen.pdf.

The eigenvectors of the covariance matrix give the directions of the principal component axes, and the eigenvalues are coefficients for the eigenvectors, and give the scalar amount of variance within each PC. The PCs in order of significance can be obtained by ranking the eigenvalues for each eigenvector from highest to lowest. To get the percentages of the variance carried by each PC, divide each of eigenvalue by the sum of all eigenvalues.

Next, we have to determine whether we want to keep some of the lesser components (the ones with low eigenvalues). We form a matrix called the feature vector with the eigenvectors of the components we do keep. This demonstrates the concept of dimensionality reduction since we are subtracting from the initial amount of principal components we had, which is equal to the dimension of the original dataset.

Lastly, we use our feature vector to restructure our dataset in a sense. We want to put our data in terms of the axes given by the principal components instead of the original axes. We can do this pretty easily by multiplying the transpose of the feature vector by the transpose of the original dataset.