Implementing Principal Component Analysis (PCA) in R
Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
—- Abraham Lincoln
The above Abraham Lincoln quote has a great influence in the machine learning too. When it comes to modeling different machine learning models most of the time need to spend on data preprocessing and feature engineering stages.
The general idea of feature engineering is to identify the influence features over all the available features. So the identified features used to train the model.
Identifying the influence features doesn’t mean picking the features in an analytical way. Some times converting latent features into a meaningful feature. This is known as dimensionality reduction.
In this article, you will learn the basic concept to perform the dimensionality reduction with one famous approach know as Principal component analysis. In short PCA
Before we drive further let’s look at the table of content for this article.
Table of contents:
- Lifting the curse using principal component analysis
- Curse of dimensionality in layman’s terms
- Shlen’s paper nuggets on Principal component analysis
- PCA conceptual background
- Principal component analysis implementation in R programming language
- Loading Iris data set
- Covariance matrix calculation
- Eigen values and Eigen vectors calculation
- PCA component calculation
- Importance of the components
- Summary
The main intention of this article is to explain how to perform the principal component analysis in R.
So let’s begin.
Lifting the Curse using Principal Component Analysis
Many problems in Analytics are often visioned to have incomplete data with a few features. This led to the development of the common myth.
Having more features and more data will always improve the accuracy of solving the machine learning problem.
In reality, this is a curse more than a boon.
Sometimes, we face a situation. When there are a lot of features with few data points. Fitting a model in this scenario often leads to a low accuracy model even with many features. It has been encountered so many times that it is also called as the “Curse of dimensionality”.
Curse of dimensionality in layman’s terms
In layman’s language, the curse of dimensionality refers to the phenomenon when an increase in the number of features results in decrease the model accuracy.
We can understand the curse of dimensionality in another way also. Which is like increase in the number of features increase the model complexity (more precise the model complexity increase exponentially)
There are two ways to stay away from this curse of dimensionality.
- Add more data to the problem.
- Reduce the number of features in the data.
Adding data may not be possible in many scenarios as the data is often limited to what was collected. Reducing the number of features, a technique is known as
Adding data may not be possible in many scenarios as the data is often limited to what was collected. Reducing the number of features is more preferable. Such a technique is known as “Dimensionality reduction” is thus more preferable. PCA or Principal component analysis is a very popular dimensionality reduction technique.
Shlen’s paper nuggets on Principal component analysis
Principal component analysis aptly described in the famous Shlen’s paper.
The paper explains that even a simple problem such as recording the motion of a pendulum, which moves in only one direction. If one is unaware of the exact direction. The number of cameras required to record its movement will at least be three, given that we are able to place the cameras perpendicular to each other.
If we do not have the knowledge of keeping the cameras perpendicular to each other, then more cameras will be required. The problem then keeps on increasing and more and more cameras are required as available information keeps on decreasing.
The PCA technique transforms our features (or cameras) into a new dimensional space and represents it as a set of new orthogonal variables so that our problem is observed with a reduced set of features.
These orthogonal features are known as “Principal Components”.
In practice, our data is like the motion of the pendulum. If we had complete knowledge of the system, we will require a small number of features. Without the knowledge, we have to observe the system using a set of features which will convey maximum information if they are orthogonal. This is done using principal component analysis.
The new set of features which are produced after PCA transformation are linearly uncorrelated as they are orthogonal. Moreover, the features are arranged in decreasing order of their importance.
This means that the first principal component alone will explain a very large component of the data. The second principal component will explain less than the first major component but more than all other components. The last principal component will explain only a small change in the data. Typically, one can run PCA and take the top principal components such that they together explain most of the data. In most analytical problems, explaining 95-99% of the is considered very high.
PCA Conceptual Background
Before getting our hands dirty with PCA in R. Let’s understand the concept with a simple example.
Suppose we have a dataset with ‘m’ data points and ‘n’ features. We can call this dataset as a m*n matrix A
A11 A12 … A1n
A= A21 A22 … A2n
…………
Am1 Am2 … Amn
We will transform this matrix A to A’ such that A’ is a m*k matrix and k<n. The number ‘k’ represents the subset of the number of transformed features.
How do we transform a given set of features into a new feature set such that they are orthogonal?
The answer is the eigenvectors of the matrix.
We know that eigenvectors are orthogonal to each other so transforming our features in the direction of the eigenvectors will also make them orthogonal. But wait! Before transforming a matrix, it is always recommended to normalize.
If the matrix is not normalized, our transformation will always be in favor of the feature with the largest scale of values. This is why PCA is sensitive to the relative scaling of the original variables.
We do that by observing the loadings of the PCA. Loadings are the factors which show what amount of load is shared by a particular feature. So, the component with maximum loading value can be considered as the most influential feature.
The loadings are arranged in such a manner that the higher significance factors are placed first, followed by subsequent factors. On the basis of our tolerance level or precision level required, we select the number of components to be considered. One important thing for PCA to get started is to get the same type of data sets, and therefore, we normalize the original data. In order to capture the variances, we need to identify the distribution of data, and therefore we transform the data in a normal distribution.
Variance is a nice measure of letting us know the kind of distribution. If we can maximize the variance, then our information is maximized. PCA can be done on any distribution, but as variance measure is the best parametric measure for normal distribution and also, data sets generally trend to follow the normal distribution or can be transformed easily to be made to follow the normal distribution, so we normally take normal distribution variance analysis in PCA.
To capture the variance of each feature with respect to other feature we try to get the variance – covariance matrix of the features, followed by finding out the eigenvalues of the matrix and then finding the eigenvectors of the matrix of the dataset which gives the various principal components.
We now construct the covariance matrix of A by multiplying A with its transpose At. This will give us a m*m matrix Cov A
Cov A = A.At
We can now simply calculate the eigenvectors and eigenvalues of the Covariance matrix Cov A. The Eigenvalues will represent the relative variance of the data.
Principal component analysis implementation in R programming language
Now that we understand the concept of PCA. We can implement the same in R programming language.
The princomp() function in R calculates the principal components of any data. We will also compare our results by calculating eigenvectors and eigenvalues separately. Let’s use the IRIS dataset.
Let’s start by loading the dataset.
1 2 | # Taking the numeric part of the IRIS data data_iris <– iris[1:4] |
The iris dataset having 150 observations (rows) with 4 features.
Let’s use the cov() function to calculate the covariance matrix of the loaded iris data set.
1 2 | # Calculating the covariance matrix Cov_data <– cov(data_iris ) |
The next step is to calculate the eigenvalues and eigenvectors.
We can use the eigen() function to do this automatically for us.
1 2 | # Find out the eigenvectors and eigenvalues using the covariance matrix Eigen_data <– eigen(Cov_data) |
We have calculated the Eigen values from the data. We will now look at the PCA function princomp() which automatically calculates these values.
Let’s calculate the components and compare the values.
1 2 | # Using the inbuilt function PCA_data <– princomp(data_iris ,cor=“False”) |
Let’s now compare the output variances
1 2 | # Let’s now compare the output variances Eigen_data$values |
Output
1 | # The output is 4.22824171 0.24267075 0.07820950 0.02383509 |
1 | PCA_data$sdev^2 |
Output
1 | # The output is 4.20005343 0.24105294 0.07768810 0.02367619 |
There is a slight difference due to squaring in PCA_data but the outputs are more or less similar. We can also compare the eigenvectors of both models.
1 2 | PCA_data$loadings[,1:4] Eigen_data$vectors |
Output
1 2 3 4 5 6 7 8 9 10 11 | Comp.1 Comp.2 Comp.3 Comp.4 Sepal.Length 0.36138659 –0.65658877 –0.58202985 0.3154872 Sepal.Width –0.08452251 –0.73016143 0.59791083 –0.3197231 Petal.Length 0.85667061 0.17337266 0.07623608 –0.4798390 Petal.Width 0.35828920 0.07548102 0.54583143 0.7536574 [,1] [,2] [,3] [,4] [1,] 0.36138659 –0.65658877 –0.58202985 0.3154872 [2,] –0.08452251 –0.73016143 0.59791083 –0.3197231 [3,] 0.85667061 0.17337266 0.07623608 –0.4798390 [4,] 0.35828920 0.07548102 0.54583143 0.7536574 |
This time the eigenvectors calculated are same and there is no difference.
Let us now understand our model. We transformed our 4 features into 4 new orthogonal components. To know the importance of the first component, we can view the summary of the model.
1 | summary(PCA_data) |
Importance of components
1 2 3 4 5 | Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 2.0494032 0.49097143 0.27872586 0.153870700 Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184 Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000 |
From the Proportion of Variance, we see that the first component has an importance of 92.5% in predicting the class while the second principal component has an importance of 5.3% and so on. This means that using just the first component instead of all the 4 features will make our model accuracy to be about 92.5% while we use only one-fourth of the entire set of features.
If we want the higher accuracy, we can take the first two components together and obtain a cumulative accuracy of up to 97.7%. We can also understand how our features are transformed by using the biplot function on our model.
1 | biplot (PCA_data) |
The X-Axis represents the first principal component. We see that the Petal Width and Petal Length vectors are parallel to the X-Axis. Hence they are combined and completely transformed into the first principal component. The first component also contains some part of Sepal Length and Sepal Width. The vertical part of the Sepal Length and Sepal Width Vectors are explained by the second principal component.
To determine what should be an ‘ideal’ set of features we should take after using PCA, we use a screeplot diagram. The screeplot() function in R plots the components joined by a line. We look at the plot and find the point of ‘arm-bend’. This is the point where the cumulative contribution starts decreasing and becomes parallel to the x-axis.
1 | screeplot(PCA_data, type=“lines”) |
This plot shows the bend at the second principal component.
Let us now fit two naive Bayes models.
- one over the entire data.
- The second on the first principal component.
We will calculate the difference in accuracy between these two models.
1 2 3 4 5 6 7 8 9 10 11 12 13 | #Select the first principal component for the second model model2 = PCA_data$loadings[,1] #For the second model, we need to calculate scores by multiplying our loadings with the data model2_scores <– as.matrix(data_iris) %*% model2 #Loading libraries for naiveBayes model library(class) library(e1071) #Fitting the first model over the entire data mod1<–naiveBayes(iris[,1:4], iris[,5]) #Fitting the second model using the first principal component mod2<–naiveBayes(model2_scores, iris[,5]) |
Accuracy of the first model
1 2 | # Accuracy for the first model table(predict(mod1, iris[,1:4]), iris[,5]) |
1 2 3 4 | setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 |
Accuracy of the second model
1 2 | # Accuracy for the second model table(predict(mod2, model2_scores), iris[,5]) |
We can see that there is a difference of 3 predictions between the two models. In return for reducing our data to one-fourth of the original, we lose on the accuracy only slightly. Hence it is a great tradeoff.
With this, we came to the end of the article. Let’s summarize key points.
Summary
PCA is a very popular method of dimensionality reduction because it provides a way to easily reduce the dimensions and is easy to understand. For this reason, PCA has been used in various applications from image compression to complex gene comparison. While using PCA, one should keep in mind its limitations well.
PCA is very sensitive to the scale of the data. It will create an initial basis in the direction of the largest variance in the data. Moreover, PCA applies a transformation over the data where all new components are orthogonal. The new features may not be interpretable in business.
Another limitation of PCA is the reliance on the mean and variance of the data. If the data has a relationship in higher dimensions such as kurtosis and skewness then PCA may not be the right technique to use on the data. In situations when the features are already orthogonal to each other and are uncorrelated, PCA will not produce any useful results except ordering the features in decreasing order of their variances.
PCA is very useful in situations when the data at hand is very large. Example, in case of image compression, PCA can be used to store the image in the first few hundred components and use less number of pixels.
The entire code used in the article is as follows:
You can also clone the complete code from DataAspirant GitHub.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | #Taking the numeric part of the IRIS data data_iris <– iris[1:4] #Calculating the covariance matrix Cov_data <– cov(data_iris ) #Find out the eigenvectors and eigenvalues using the covariance matrix Eigen_data <– eigen(Cov_data) #Using the inbuilt function PCA_data <– princomp(data_iris ,cor=“False”) #Let’s now compare the output variances Eigen_data$values #The output is 4.22824171 0.24267075 0.07820950 0.02383509 PCA_data$sdev^2 #The output is 4.20005343 0.24105294 0.07768810 0.02367619 PCA_data$loadings[,1:4] Eigen_data$vectors summary(PCA_data) biplot (PCA_data) screeplot(PCA_data, type=“lines”) #Select the first principal component for the second model model2 = PCA_data$loadings[,1] #For the second model, we need to calculate scores by multiplying our loadings with the data model2_scores <– as.matrix(data_iris) %*% model2 #Loading libraries for naiveBayes model library(class) library(e1071) #Fitting the first model over the entire data mod1<–naiveBayes(iris[,1:4], iris[,5]) #Fitting the second model using the first principal component mod2<–naiveBayes(model2_scores, iris[,5]) #Accuracy for the first model table(predict(mod1, iris[,1:4]), iris[,5]) #Accuracy for the second model table(predict(mod2, model2_scores), iris[,5]) |