Principal Component Analysis (PCA) is a technique to reduce the number of predictors. If a dataset only has three predictors, fitting all of them should not be that much of a problem. But if a dataset has like 50 predictors, well, PCA should be handy.
I used the famous Titanic dataset available here.
Next, we need to load our goodies.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
##### Load Goodies ##### library(tidyverse) library(dummies) ##### Theme Moma ##### theme_moma <- function(base_size = 12, base_family = "Helvetica") { theme( plot.background = element_rect(fill = "#F7F6ED"), legend.key = element_rect(fill = "#F7F6ED"), legend.background = element_rect(fill = "#F7F6ED"), panel.background = element_rect(fill = "#F7F6ED"), panel.border = element_rect(colour = "black", fill = NA, linetype = "dashed"), panel.grid.minor = element_line(colour = "#7F7F7F", linetype = "dotted"), panel.grid.major = element_line(colour = "#7F7F7F", linetype = "dotted") ) } |
1 2 |
##### Data ##### data <- read.csv('train.csv') |
Let’s exclude irrelevant predictors.
1 2 |
##### Irrelevant Predictors ##### data2 <- select(data, -Name, -Ticket, -Cabin) |
prcomp() doesn’t work with NA. Therefore we need to exclude them from the dataset.
1 2 |
##### Omission ##### data3 <- na.omit(data2) |
Another constraint is that prcomp() only works with numeric variables and doesn’t have a one-hot encoding built-in, so we need to do it.
1 |
glimpse(data3) |
1 2 3 4 5 6 7 8 9 10 11 12 |
> glimpse(data3) Observations: 714 Variables: 9 $ PassengerId <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25,... $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0... $ Pclass <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 1, 1, 2, 1... $ Sex <fctr> male, female, female, female, male, male, male, female, female, female, female... $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, 31, 35, 34, 15, 28... $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0, 3, 1, 3, 0, 0, 1... $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 1, 5, 2, 0, 0, 0... $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 11.1333, 30.0708, 1... $ Embarked <fctr> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q, S, S, S, S, C, S, ... |
We need to work on Sex, and Embarked.
1 2 |
##### Dummy Variable ##### data4 <- dummy.data.frame(data3, names = c('Sex','Embarked')) |
Let’s check one last time if our dataset is ready for prcomp() .
1 2 |
##### NA ##### sum(is.na(data4)) |
1 2 |
> sum(is.na(data4)) [1] 0 |
1 2 |
##### Glimpse ##### glimpse(data4) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> glimpse(data4) Observations: 714 Variables: 13 $ PassengerId <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25,... $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0... $ Pclass <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 1, 1, 2, 1... $ Sexfemale <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0... $ Sexmale <int> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1... $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, 31, 35, 34, 15, 28... $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0, 3, 1, 3, 0, 0, 1... $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 1, 5, 2, 0, 0, 0... $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 11.1333, 30.0708, 1... $ Embarked <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ EmbarkedC <int> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1... $ EmbarkedQ <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0... $ EmbarkedS <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0... |
Okay. We are ready to use prcomp() .
1 2 |
##### Fitting PCA ##### pca <- prcomp(data4, scale = TRUE) |
If we don’t set scale = TRUE, a predictor with the highest variance will distort the result. As we are interested in the Principal Component that can explain the most variability in the data, we need to calculate it from pca$sdev .
1 2 3 |
##### Explained Variance ##### pca_sdev2 <- pca$sdev^2 pve<- pca_sdev2/sum(pca_sdev2) |
Now we are ready for the plot. Let’s start by creating a new data frame.
1 2 3 4 |
##### DF for Plotting ##### pca_ggplot <- data.frame(Principal_Component = seq(1:13), pca_var = pca_var, cumulative = cumsum(pca_var)) |
Next is the explained variance of each principal component.
1 2 3 4 5 6 |
##### Explained Variance ##### pca_ggplot %>% ggplot(aes(x = Principal_Component, y = pca_var)) + geom_line() + geom_point() + theme_moma() |
So, the highest principal component could explain around 25%. The cumulative plot probably is better.
1 2 3 4 5 6 |
##### Cumulative Explained Variance ##### pca_ggplot %>% ggplot(aes(x = Principal_Component, y = cumulative)) + geom_line() + geom_point() + theme_moma() |
Now is the time to make a decision, how many components should we exclude? Maybe three? As ten components could explain over 95% already.