Let’s use Machine Learning technique called Logistics Regression to forecast employee’s resignations.
We got a pretty good insight from the EDA section (here.) Although we get some rough profile of employees who were likely to resign, we can do better than that. As we are dealing with binary variable, let’s try Logistics Regression.
I will use “data2” dataset to do the Machine Learning. Since there are a number of factor variables, we need to do the one-hot encoding. As much as I love a “dummies” package for one-hot encoding, sometimes I just want to do it manually as it is easier to see what goes wrong if there is one. So in this case, I’ll just use Dplyr. In case that you visited this page before EDA, please visit (HR – EDA) to see necessary prep works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
data3 <- data2%>% mutate(proj3 = ifelse(number_project == 3, 1,0)) %>% mutate(proj4 = ifelse(number_project == 4, 1,0)) %>% mutate(proj5 = ifelse(number_project == 5, 1,0)) %>% mutate(proj6 = ifelse(number_project == 6, 1,0)) %>% mutate(proj7 = ifelse(number_project == 7, 1,0)) %>% mutate(dep_hr = ifelse(department == "hr", 1,0)) %>% mutate(dep_IT = ifelse(department == "IT", 1,0)) %>% mutate(dep_mngt = ifelse(department == "management", 1,0)) %>% mutate(dep_mkt = ifelse(department == "marketing", 1,0)) %>% mutate(dep_prod = ifelse(department == "product_mng", 1,0)) %>% mutate(dep_RandD = ifelse(department == "RandD", 1,0)) %>% mutate(dep_sales = ifelse(department == "sales", 1,0)) %>% mutate(dep_sup = ifelse(department == "support", 1,0)) %>% mutate(dep_tech = ifelse(department == "technical", 1,0)) %>% mutate(sal_med = ifelse(salary == "medium",1,0)) %>% mutate(sal_high = ifelse(salary == "high",1,0)) %>% #removing original variables select(-number_project, -department, -salary) |
Now I will split 75% of the data to train set and the rest to test set.
1 2 3 4 5 6 7 |
#splitting size <- floor(0.75*nrow(data3)) set.seed(999) train_index <- sample(seq_len(nrow(data3)), size = size) train <- data3[train_index,] test <- data3[-train_index,] |
Okay, now it is time to create the model.
1 |
logistics <- glm(left ~ ., data=train, family = "binomial") |
Let’s see the result
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
> logistics Call: glm(formula = left ~ ., family = "binomial", data = train) Coefficients: (Intercept) satisfaction_level last_evaluation avg_mth_hrs tenure3 -2.815573 -1.995981 2.256058 0.008536 2.476690 tenure4 tenure5 tenure6 tenure7 tenure8 2.607359 4.598674 3.390544 -14.324361 -14.236489 tenure10 accident1 promotions1 proj3 proj4 -14.188711 -1.635066 -1.132285 -5.232397 -3.984956 proj5 proj6 proj7 dep_hr dep_IT -3.295113 -2.466936 15.895408 0.188696 -0.190432 dep_mngt dep_mkt dep_prod dep_RandD dep_sales -0.200763 -0.023563 -0.348506 -0.568599 -0.033864 dep_sup dep_tech sal_med sal_high -0.039155 0.059123 -0.464789 -1.951573 Degrees of Freedom: 11248 Total (i.e. Null); 11220 Residual Null Deviance: 12370 Residual Deviance: 5990 AIC: 6048 |
Oh that’s a whole lot of number. The formula for logistics regression is as follows:
\(P(x) = \frac{1}{1+{e}^{-(q)}}\)
where q is
$$q = \beta_0+\beta_1x_1+…+\beta_nx_n$$
In our case, \(P(x)\) is the probability that an employee would leave the company, and \(n\) is 23 as we have 23 predictors.
It is certainly doable to manually create the calculation. But why would we do that since we have predict() function! 😀
So, let’s see the predictive power of our model.
1 2 |
#Predict! test$predicted <- predict(logistics, newdata = test, type = "response") |
As we are interested to see how good our model is, I will round the number so that it is either 1 or 0 for easy comparison.
1 2 |
#Round the Predicted Value test$round_predicted <- round(test$predicted) |
Let’s see the result
1 2 3 |
ggplot(test, aes(x=round_predicted , y= left, col=left)) + geom_jitter(alpha=0.3) + theme_moma() |
That’s not too bad. The bottom right and upper left are incorrect predictions. Judging from the clusters, we didn’t miss that much. But let’s see the numbers
1 2 3 4 |
test %>% mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>% count(incorrect) %>% mutate(Percentage = percent(n/sum(n))) |
1 2 3 4 5 |
# A tibble: 2 × 3 incorrect n Percentage <chr> <int> <chr> 1 N 3386 90.3% 2 Y 364 9.7% |
Not bad. We missed only 9.7%. Pretty good for some plain simple method. 🙂
The calculation of 90.3% is, in fact, has another official name: Confusion Matrix. Confusion Matrix has several variation which is used to evaluate the quality of the model.
One of the popular variations is called Accuracy which is defined as
\(Accuracy = \frac{TP+TN}{TP+TN+FP+FN}\)
where
$$\begin{align*}&TP=True Positive\\&TN=True Negative\\&FN = False Negative\\&FP = False Positive\\\end{align*}$$
We can calculate \(Accuracy\) in the formal form by using the following code.
1 2 |
accuracy <- table(test$left,test$round_predicted) percent(sum(diag(accuracy))/sum(accuracy)) |
1 2 3 |
> accuracy <- table(test$left,test$round_predicted) > percent(sum(diag(accuracy))/sum(accuracy)) [1] "90.3%" |
Yep, it is exactly the same as 90.3% from Dplyr code. Now, let’s focus on what we missed. Among 364 incorrect prediction, where did we miss the most?
1 2 3 4 5 |
test %>% mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>% filter(incorrect == "Y") %>% count(round_predicted ) %>% mutate(Percentage = percent(n/sum(n))) |
1 2 3 4 5 |
# A tibble: 2 × 3 round_predicted n Percentage <dbl> <int> <chr> 1 0 175 48.08% 2 1 189 51.92% |
It is about 50% in each category. It is just a bit more on the “Yes” prediction. So we predicted 189 employees would leave, but they didn’t.
TL;DR… Logistics Regression correctly predicted 90.3% on the 25% test set. The distribution of incorrect calculation spread evenly between the two groups. The result is not bad comparing to the costs (time it took to prepare the data, implementation)