This is a tutorial on how to run a PCA using FactoMineR, and visualize the result using ggplot2.

Introduction

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components ( Wikipedia ).

PCA is a useful tool for exploring patterns in highly-dimensional data (data with lots of variables). It does this by constructing new variables, or principle components, that contain elements of all of the variables we start with, and can be used to identify which of our variables are best at capturing the variation in our data.

In this post, I am not going to spend too much time talking about the details of running a PCA, and interpreting the results. Rather, I want to show you a simple way of making easily customizable PCA plots, using ggplot2.

Let’s get started!

Packages

First, we can load a few packages.

Setting up the data

The dataset I’ll be using is the ‘diamonds’ dataset, which contains data on almost 54 000 diamonds. The variables are:

price = price in US dollars ($326–$18,823)

carat = weight of the diamond (0.2–5.01)

cut = quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color = diamond colour, from J (worst) to D (best)

clarity = a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x = length in mm (0–10.74)

y = width in mm (0–58.9)

z = depth in mm (0–31.8)

depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table = width of top of diamond relative to widest point (43–95)

Let’s load the data.

Because the dataset it fairly large, I am going to trim it down to a random subset.

An often-overlooked issue is the need to scale the data before running a PCA. Because PCA attempts to maximize variance, if some variables have a large variance and some small, PCA will load more heavily on the large variances. Centering (subtracting the mean of a variable from each observation) and scaling (dividing each observation by the standard deviation of that variable) will deal with this.

Now the mean and standard deviation of the numerical variables should be 0 and 1 respectively.

On to the PCA.

Running the PCA

We can plot the PCA, using ‘plot.PCA’.

Visualizing the PCA using ggplot2

Here’s how we can do it with ggplot2. First, we extract the information we want from our ‘pca1’ object.

We also need to extract the data for the variable contributions to each of the pc axes.

By convention, the variable contribution plot has a circle around the variables that has a radius of 1. Here’s some code to make one.

Now we can make our initial plot of the PCA.

And we can customize it a bit. For example, coloring and shaping the points by cut.

We might want to change the shapes that ggplot2 is using by default, and also, for example, set the ‘good’ and ‘very good’ diamonds to have the same shape.

Now the Cut categories ‘good’ and ‘very good’ have different colors, but are represented by the same shape. We can also add confidence ellipse, make the theme look nicer, and fix the axis labels.

Now, for plotting the variable contributions, we can use the following code.

And now let’s put the PCA scatterplot and the variable contribution plots together using cowplot.

A more flexible approach to combining plots would be to use the ‘ggdraw’ function. We can use this to embed the variable contribution plot into the scatterplot.