Menu

Principal Component Analysis (PCA) is a technique to reduce the number of predictors. If a dataset only has three predictors, fitting all of them should not be that much of a problem. But if a dataset has like 50 predictors, well, PCA should be handy.

I used the famous Titanic dataset available here.

Next, we need to load our goodies.

Let’s exclude irrelevant predictors.

prcomp()  doesn’t work with NA. Therefore we need to exclude them from the dataset.

Another constraint is that prcomp()  only works with numeric variables and doesn’t have a one-hot encoding built-in, so we need to do it.

We need to work on Sex, and Embarked.

Let’s check one last time if our dataset is ready for prcomp() .

Okay. We are ready to use prcomp() .

If we don’t set scale = TRUE,  a predictor with the highest variance will distort the result. As we are interested in the Principal Component that can explain the most variability in the data, we need to calculate it from pca$sdev .

Now we are ready for the plot. Let’s start by creating a new data frame.

Next is the explained variance of each principal component.

So, the highest principal component could explain around 25%. The cumulative plot probably is better.

Now is the time to make a decision, how many components should we exclude? Maybe three? As ten components could explain over 95% already.