Menu

Let’s use Machine Learning technique called Logistics Regression to forecast employee’s resignations.

We got a pretty good insight from the EDA section (here.) Although we get some rough profile of employees who were likely to resign, we can do better than that. As we are dealing with binary variable, let’s try Logistics Regression.

I will use “data2” dataset to do the Machine Learning. Since there are a number of factor variables, we need to do the one-hot encoding. As much as I love a “dummies” package for one-hot encoding, sometimes I just want to do it manually as it is easier to see what goes wrong if there is one. So in this case, I’ll just use Dplyr. In case that you visited this page before EDA, please visit (HR – EDA) to see necessary prep works.

Now I will split 75% of the data to train set and the rest to test set.

Okay, now it is time to create the model.

Let’s see the result

Oh that’s a whole lot of number. The formula for logistics regression is as follows:

\(P(x) = \frac{1}{1+{e}^{-(q)}}\)
where q is

$$q = \beta_0+\beta_1x_1+…+\beta_nx_n$$

In our case, \(P(x)\) is the probability that an employee would leave the company, and \(n\) is 23 as we have 23 predictors.

It is certainly doable to manually create the calculation. But why would we do that since we have predict() function!   😀

So, let’s see the predictive power of our model.

As we are interested to see how good our model is, I will round the number so that it is either 1 or 0 for easy comparison.

Let’s see the result

That’s not too bad. The bottom right and upper left are incorrect predictions. Judging from the clusters, we didn’t miss that much. But let’s see the numbers

Not bad. We missed only 9.7%. Pretty good for some plain simple method. 🙂

The calculation of 90.3% is, in fact, has another official name: Confusion Matrix. Confusion Matrix has several variation which is used to evaluate the quality of the model.

One of the popular variations is called Accuracy which is defined as

\(Accuracy = \frac{TP+TN}{TP+TN+FP+FN}\)

where

$$\begin{align*}&TP=True Positive\\&TN=True Negative\\&FN = False Negative\\&FP = False Positive\\\end{align*}$$

We can calculate \(Accuracy\) in the formal form by using the following code.

Yep, it is exactly the same as 90.3% from Dplyr code. Now, let’s focus on what we missed. Among 364 incorrect prediction, where did we miss the most?

It is about 50% in each category. It is just a bit more on the “Yes” prediction. So we predicted 189 employees would leave, but they didn’t.

TL;DR… Logistics Regression correctly predicted 90.3% on the 25% test set. The distribution of incorrect calculation spread evenly between the two groups. The result is not bad comparing to the costs (time it took to prepare the data, implementation)