As usual, we need to load the goodies.
1 2 3 |
##### Load Libraries ##### library(e1071) library(tidyverse) |
Instead of creating a new dataset, we will just use diamonds data from GGPLOT2.
1 2 |
##### Load Data ##### data <- diamonds |
We will try to predict the quality of the cut using other nine predictors.
svm() needs the dependent variable to be a factor before putting in the function. So, let’s see.
1 2 |
##### glimpse ##### glimpse(data) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> glimpse(data) Observations: 500 Variables: 10 $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23, 0.22, 0.3... $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Very Good, Fair, Very ... $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I, E, H, J, J, G, I, ... $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI2, SI2, I1, ... $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8, 60.4, 62.... $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58, 54, 54, 56, 59, 5... $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 342, 344, 345, 345, 3... $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93, 3.88, 4.3... $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90, 3.84, 4.3... $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46, 2.33, 2.7... |
Nice! Cut is already a factor variable. One fewer task.
As my laptop is not that powerful and svm() does take a lot of time to work on, let’s use only first 500 observations.
1 2 |
##### Subset ##### data <- data[1:500,] |
Next, we fit the data.
1 2 3 4 5 6 7 8 9 |
##### Fit ##### svm_1 <- svm(cut ~ . , data = data #kernel = 'radial', #cost = 1, #gamma = 1) ) summary(svm_1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
> summary(svm_1) Call: svm(formula = cut ~ ., data = data) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.04761905 Number of Support Vectors: 427 ( 107 124 51 124 21 ) Number of Classes: 5 Levels: Fair Good Very Good Premium Ideal |
Kernel, cost, and gamma are three critical parameters in the svm() . If we don’t specify a value for cost and gamma, the function will use the default value which is only a specific value. As for kernel, a simple EDA should be able to give a glimpse into data distribution. The chance is it is not linear. Haha. But it’s worth to check. If we don’t specify, as usual, the model will just pick one for us.
With svm() we can only specify only one value. But the creator did provide a handy function: tune() .
1 2 3 4 5 6 |
##### Tune ##### set.seed(1) svm_tune <- tune(svm, cut ~ ., data = data, kernel = "radial", ranges =list(cost=seq(0.1,1,0.1), gamma=seq(0.1,1,0.1))) |
We need to use set.seed() because tune() will perform 10-fold cross validation by default.
1 |
summary(svm_tune) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
> summary(svm_tune) Parameter tuning of ‘svm’: - sampling method: 10-fold cross validation - best parameters: cost gamma 1 0.3 - best performance: 0.418 - Detailed performance results: cost gamma error dispersion 1 0.1 0.1 0.498 0.05452828 2 0.2 0.1 0.464 0.04299871 3 0.3 0.1 0.434 0.04221637 4 0.4 0.1 0.436 0.04402020 |
Yep, the function will calculate all possible combinations of cost and gamma. It’s too troublesome to eyeball the best model. We can do this.
1 |
svm_tune$best.model |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> svm_tune$best.model Call: best.tune(method = svm, train.x = cut ~ ., data = data, ranges = list(cost = seq(0.1, 1, 0.1), gamma = seq(0.1, 1, 0.1)), kernel = "radial") Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.3 Number of Support Vectors: 434 |
When it comes to prediction, unfortunately, predict() cannot deal with the class “tune.”
1 2 |
##### Predict ##### predict(svm_tune, data) |
1 2 3 |
> predict(svm_tune, data) Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "tune" |
Therefore we have to manually create a new svm() model with the best values from the tune, then plug the model in predict() .
1 2 3 |
##### Refit and Predict##### svm_2 <- svm(cut ~ ., data = data, cost = 1, gamma = 0.3) summary(predict(svm_2, data)) |
1 2 3 |
> summary(predict(svm_2, data)) Fair Good Very Good Premium Ideal 24 31 119 157 169 |
Now it works!