HR Analytics – Who Is Next?

How many times have we lost rising talents to competitors? It feels sad when they come back with a better package that we could not beat. If we had known when they were likely to quit, we would have given a larger bonus to prevent a job hunting attempt and could have kept talents with us longer. Data can do just that! 😀

Let’s use Machine Learning technique called Logistics Regression to forecast employee’s resignations.

We got a pretty good insight from the EDA section (here.) Although we get some rough profile of employees who were likely to resign, we can do better than that. As we are dealing with binary variable, let’s try Logistics Regression.

I will use “data2” dataset to do the Machine Learning. Since there are a number of factor variables, we need to do the one-hot encoding. As much as I love a “dummies” package for one-hot encoding, sometimes I just want to do it manually as it is easier to see what goes wrong if there is one. So in this case, I’ll just use Dplyr. In case that you visited this page before EDA, please visit (HR – EDA) to see necessary prep works.

data3 <- data2%>%
  mutate(proj3 = ifelse(number_project == 3, 1,0)) %>%
  mutate(proj4 = ifelse(number_project == 4, 1,0)) %>%
  mutate(proj5 = ifelse(number_project == 5, 1,0)) %>%
  mutate(proj6 = ifelse(number_project == 6, 1,0)) %>%
  mutate(proj7 = ifelse(number_project == 7, 1,0)) %>%
  mutate(dep_hr = ifelse(department == "hr", 1,0)) %>%
  mutate(dep_IT = ifelse(department == "IT", 1,0)) %>%
  mutate(dep_mngt = ifelse(department == "management", 1,0)) %>%
  mutate(dep_mkt = ifelse(department == "marketing", 1,0)) %>%
  mutate(dep_prod = ifelse(department == "product_mng", 1,0)) %>%
  mutate(dep_RandD = ifelse(department == "RandD", 1,0)) %>%
  mutate(dep_sales = ifelse(department == "sales", 1,0)) %>%
  mutate(dep_sup = ifelse(department == "support", 1,0)) %>%
  mutate(dep_tech = ifelse(department == "technical", 1,0)) %>%
  mutate(sal_med = ifelse(salary == "medium",1,0)) %>%
  mutate(sal_high = ifelse(salary == "high",1,0)) %>%
  #removing original variables
  select(-number_project, -department, -salary)

data3 <- data2%>%

mutate(proj3 = ifelse(number_project == 3, 1,0)) %>%

mutate(proj4 = ifelse(number_project == 4, 1,0)) %>%

mutate(proj5 = ifelse(number_project == 5, 1,0)) %>%

mutate(proj6 = ifelse(number_project == 6, 1,0)) %>%

mutate(proj7 = ifelse(number_project == 7, 1,0)) %>%

mutate(dep_hr = ifelse(department == "hr", 1,0)) %>%

mutate(dep_IT = ifelse(department == "IT", 1,0)) %>%

mutate(dep_mngt = ifelse(department == "management", 1,0)) %>%

mutate(dep_mkt = ifelse(department == "marketing", 1,0)) %>%

mutate(dep_prod = ifelse(department == "product_mng", 1,0)) %>%

mutate(dep_RandD = ifelse(department == "RandD", 1,0)) %>%

mutate(dep_sales = ifelse(department == "sales", 1,0)) %>%

mutate(dep_sup = ifelse(department == "support", 1,0)) %>%

mutate(dep_tech = ifelse(department == "technical", 1,0)) %>%

mutate(sal_med = ifelse(salary == "medium",1,0)) %>%

mutate(sal_high = ifelse(salary == "high",1,0)) %>%

#removing original variables

select(-number_project, -department, -salary)

Now I will split 75% of the data to train set and the rest to test set.

#splitting
size <- floor(0.75*nrow(data3))
set.seed(999)
train_index <- sample(seq_len(nrow(data3)), size = size)

train <- data3[train_index,]
test <- data3[-train_index,]

#splitting

size <- floor(0.75*nrow(data3))

set.seed(999)

train_index <- sample(seq_len(nrow(data3)), size = size)

train <- data3[train_index,]

test <- data3[-train_index,]

Okay, now it is time to create the model.

logistics <- glm(left ~ ., data=train, family = "binomial")

1	logistics <- glm(left ~ ., data=train, family = "binomial")

Let’s see the result

> logistics

Call:  glm(formula = left ~ ., family = "binomial", data = train)

Coefficients:
       (Intercept)  satisfaction_level     last_evaluation         avg_mth_hrs             tenure3  
         -2.815573           -1.995981            2.256058            0.008536            2.476690  
           tenure4             tenure5             tenure6             tenure7             tenure8  
          2.607359            4.598674            3.390544          -14.324361          -14.236489  
          tenure10           accident1         promotions1               proj3               proj4  
        -14.188711           -1.635066           -1.132285           -5.232397           -3.984956  
             proj5               proj6               proj7              dep_hr              dep_IT  
         -3.295113           -2.466936           15.895408            0.188696           -0.190432  
          dep_mngt             dep_mkt            dep_prod           dep_RandD           dep_sales  
         -0.200763           -0.023563           -0.348506           -0.568599           -0.033864  
           dep_sup            dep_tech             sal_med            sal_high  
         -0.039155            0.059123           -0.464789           -1.951573  

Degrees of Freedom: 11248 Total (i.e. Null);  11220 Residual
Null Deviance:	    12370 
Residual Deviance: 5990 	AIC: 6048

> logistics

Call: glm(formula = left ~ ., family = "binomial", data = train)

Coefficients:

(Intercept) satisfaction_level last_evaluation avg_mth_hrs tenure3

-2.815573 -1.995981 2.256058 0.008536 2.476690

tenure4 tenure5 tenure6 tenure7 tenure8

2.607359 4.598674 3.390544 -14.324361 -14.236489

tenure10 accident1 promotions1 proj3 proj4

-14.188711 -1.635066 -1.132285 -5.232397 -3.984956

proj5 proj6 proj7 dep_hr dep_IT

-3.295113 -2.466936 15.895408 0.188696 -0.190432

dep_mngt dep_mkt dep_prod dep_RandD dep_sales

-0.200763 -0.023563 -0.348506 -0.568599 -0.033864

dep_sup dep_tech sal_med sal_high

-0.039155 0.059123 -0.464789 -1.951573

Degrees of Freedom: 11248 Total (i.e. Null); 11220 Residual

Null Deviance: 12370

Residual Deviance: 5990 AIC: 6048

Oh that’s a whole lot of number. The formula for logistics regression is as follows:

$P(x) = \frac{1}{1+{e}^{-(q)}}$
where q is

$$q = \beta_0+\beta_1x_1+…+\beta_nx_n$$

In our case, $P(x)$ is the probability that an employee would leave the company, and $n$ is 23 as we have 23 predictors.

It is certainly doable to manually create the calculation. But why would we do that since we have predict() function! 😀

So, let’s see the predictive power of our model.

#Predict!
test$predicted <- predict(logistics, newdata = test, type = "response")

1 2	#Predict! test$predicted <- predict(logistics, newdata = test, type = "response")

As we are interested to see how good our model is, I will round the number so that it is either 1 or 0 for easy comparison.

#Round the Predicted Value
test$round_predicted <- round(test$predicted)

1 2	#Round the Predicted Value test$round_predicted <- round(test$predicted)

Let’s see the result

ggplot(test, aes(x=round_predicted , y= left, col=left)) + 
  geom_jitter(alpha=0.3) +
  theme_moma()

ggplot(test, aes(x=round_predicted , y= left, col=left)) +

geom_jitter(alpha=0.3) +

theme_moma()

That’s not too bad. The bottom right and upper left are incorrect predictions. Judging from the clusters, we didn’t miss that much. But let’s see the numbers

test %>%
  mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>%
  count(incorrect) %>%
  mutate(Percentage = percent(n/sum(n)))

test %>%

mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>%

count(incorrect) %>%

mutate(Percentage = percent(n/sum(n)))

# A tibble: 2 × 3
  incorrect     n Percentage
      <chr> <int>      <chr>
1         N  3386      90.3%
2         Y   364       9.7%

# A tibble: 2 × 3

incorrect n Percentage

1 N 3386 90.3%

2 Y 364 9.7%

Not bad. We missed only 9.7%. Pretty good for some plain simple method. 🙂

The calculation of 90.3% is, in fact, has another official name: Confusion Matrix. Confusion Matrix has several variation which is used to evaluate the quality of the model.

One of the popular variations is called Accuracy which is defined as

$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

where

$$\begin{align*}&TP=True Positive\\&TN=True Negative\\&FN = False Negative\\&FP = False Positive\\\end{align*}$$

We can calculate $Accuracy$ in the formal form by using the following code.

accuracy <- table(test$left,test$round_predicted)
percent(sum(diag(accuracy))/sum(accuracy))

1 2	accuracy <- table(test$left,test$round_predicted) percent(sum(diag(accuracy))/sum(accuracy))

> accuracy <- table(test$left,test$round_predicted)
> percent(sum(diag(accuracy))/sum(accuracy))
[1] "90.3%"

> accuracy <- table(test$left,test$round_predicted)

> percent(sum(diag(accuracy))/sum(accuracy))

[1] "90.3%"

Yep, it is exactly the same as 90.3% from Dplyr code. Now, let’s focus on what we missed. Among 364 incorrect prediction, where did we miss the most?

test %>%
 mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>%
 filter(incorrect == "Y") %>%
 count(round_predicted ) %>%
 mutate(Percentage = percent(n/sum(n)))

test %>%

mutate(incorrect = ifelse(left == round_predicted , "N","Y")) %>%

filter(incorrect == "Y") %>%

count(round_predicted ) %>%

mutate(Percentage = percent(n/sum(n)))

# A tibble: 2 × 3
  round_predicted     n Percentage
            <dbl> <int>      <chr>
1               0   175     48.08%
2               1   189     51.92%

# A tibble: 2 × 3

round_predicted n Percentage

1 0 175 48.08%

2 1 189 51.92%

It is about 50% in each category. It is just a bit more on the “Yes” prediction. So we predicted 189 employees would leave, but they didn’t.

TL;DR… Logistics Regression correctly predicted 90.3% on the 25% test set. The distribution of incorrect calculation spread evenly between the two groups. The result is not bad comparing to the costs (time it took to prepare the data, implementation)