top of page

Model to Predict the Attrition of McCurr Consultancy Data | Machine Learning Assignment Help

Background :

McCurr Consultancy is an MNC that has thousands of employees spread across the globe. The company believes in hiring the best talent available and retaining them for as long as possible. A huge amount of resources is spent on retaining existing employees through various initiatives. The Head of People Operations wants to bring down the cost of retaining employees. For this, he proposes limiting the incentives to only those employees who are at risk of attrition. As a recently hired Data Scientist in the People Operations Department, you have been asked to identify patterns in characteristics of employees who leave the organization. Also, you have to use this information to predict if an employee is at risk of attrition. This information will be used to target them with incentives.

Reference: Great Learning


Objective :

  • To identify which are the different factors that drive attrition?

  • Make a model to predict the attrition? Which algorithm gives the best performance?

Dataset :

The data contains demographic details, work-related metrics and attrition flag.

  • EmployeeNumber - Employee Identifier

  • Attrition - Did the employee attrite?

  • Age - Age of the employee

  • BusinessTravel - Travel commitments for the job

  • DailyRate - Data description not available**

  • Department - Employee Department

  • DistanceFromHome - Distance from work to home (in km)

  • Education - 1-Below College, 2-College, 3-Bachelor, 4-Master,5-Doctor

  • EducationField - Field of Education

  • EmployeeCount - Employee Count in a row

  • EnvironmentSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High

  • Gender - Employee's gender

  • HourlyRate - Data description not available**

  • JobInvolvement - 1-Low, 2-Medium, 3-High, 4-Very High

  • JobLevel - Level of job (1 to 5)

  • JobRole - Job Roles

  • JobSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High

  • MaritalStatus - Marital Status

  • MonthlyIncome - Monthly Salary

  • MonthlyRate - Data description not available**

  • NumCompaniesWorked - Number of companies worked at

  • Over18 - Over 18 years of age?

  • OverTime - Overtime?

  • PercentSalaryHike - The percentage increase in salary last year

  • PerformanceRating - 1-Low, 2-Good, 3-Excellent, 4-Outstanding

  • RelationshipSatisfaction - 1-Low, 2-Medium, 3-High, 4-Very High

  • StandardHours - Standard Hours

  • StockOptionLevel - Stock Option Level

  • TotalWorkingYears - Total years worked

  • TrainingTimesLastYear - Number of training attended last year

  • WorkLifeBalance - 1-Low, 2-Good, 3-Excellent, 4-Outstanding

  • YearsAtCompany - Years at Company

  • YearsInCurrentRole - Years in the current role

  • YearsSinceLastPromotion - Years since the last promotion

  • YearsWithCurrManager - Years with the current manager

** In the real world, you will not find definitions for some of your variables. It is a part of the analysis to figure out what they might mean.



Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import scipy.stats as stats
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')


Read the dataset

hr=pd.read_csv("HR_Employee_Attrition-1.csv")
# copying data to another varaible to avoid any changes to original data
data=hr.copy()


View the first and last 5 rows of the dataset.

data.head()

output:


Understand the shape of the dataset.

data.shape

Check the data types of the columns for the dataset

data.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   EmployeeNumber            2940 non-null   int64 
 1   Attrition                 2940 non-null   object
 2   Age                       2940 non-null   int64 
 3   BusinessTravel            2940 non-null   object
 4   DailyRate                 2940 non-null   int64 
 5   Department                2940 non-null   object
 6   DistanceFromHome          2940 non-null   int64 
 7   Education                 2940 non-null   int64 
 8   EducationField            2940 non-null   object
 9   EmployeeCount             2940 non-null   int64 
 10  EnvironmentSatisfaction   2940 non-null   int64 
 11  Gender                    2940 non-null   object
 12  HourlyRate                2940 non-null   int64 
 13  JobInvolvement            2940 non-null   int64 
 14  JobLevel                  2940 non-null   int64 
 15  JobRole                   2940 non-null   object
 16  JobSatisfaction           2940 non-null   int64 
 17  MaritalStatus             2940 non-null   object
 18  MonthlyIncome             2940 non-null   int64 
 19  MonthlyRate               2940 non-null   int64 
 20  NumCompaniesWorked        2940 non-null   int64 
 21  Over18                    2940 non-null   object
 22  OverTime                  2940 non-null   object
 23  PercentSalaryHike         2940 non-null   int64 
 24  PerformanceRating         2940 non-null   int64 
 25  RelationshipSatisfaction  2940 non-null   int64 
 26  StandardHours             2940 non-null   int64 
 27  StockOptionLevel          2940 non-null   int64 
 28  TotalWorkingYears         2940 non-null   int64 
 29  TrainingTimesLastYear     2940 non-null   int64 
 30  WorkLifeBalance           2940 non-null   int64 
 31  YearsAtCompany            2940 non-null   int64 
 32  YearsInCurrentRole        2940 non-null   int64 
 33  YearsSinceLastPromotion   2940 non-null   int64 
 34  YearsWithCurrManager      2940 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 804.0+ KB

Observations -

  • There are no null values in the dataset.

  • We can convert the object type columns to categories.

converting "objects" to "category" reduces the data space required to store the dataframe


Fixing the data types

cols = data.select_dtypes(['object'])
cols.columns

output:

Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')
for i in cols.columns:
    data[i] = data[i].astype('category')
data.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   EmployeeNumber            2940 non-null   int64   
 1   Attrition                 2940 non-null   category
 2   Age                       2940 non-null   int64   
 3   BusinessTravel            2940 non-null   category
 4   DailyRate                 2940 non-null   int64   
 5   Department                2940 non-null   category
 6   DistanceFromHome          2940 non-null   int64   
 7   Education                 2940 non-null   int64   
 8   EducationField            2940 non-null   category
 9   EmployeeCount             2940 non-null   int64   
 10  EnvironmentSatisfaction   2940 non-null   int64   
 11  Gender                    2940 non-null   category
 12  HourlyRate                2940 non-null   int64   
 13  JobInvolvement            2940 non-null   int64   
 14  JobLevel                  2940 non-null   int64   
 15  JobRole                   2940 non-null   category
 16  JobSatisfaction           2940 non-null   int64   
 17  MaritalStatus             2940 non-null   category
 18  MonthlyIncome             2940 non-null   int64   
 19  MonthlyRate               2940 non-null   int64   
 20  NumCompaniesWorked        2940 non-null   int64   
 21  Over18                    2940 non-null   category
 22  OverTime                  2940 non-null   category
 23  PercentSalaryHike         2940 non-null   int64   
 24  PerformanceRating         2940 non-null   int64   
 25  RelationshipSatisfaction  2940 non-null   int64   
 26  StandardHours             2940 non-null   int64   
 27  StockOptionLevel          2940 non-null   int64   
 28  TotalWorkingYears         2940 non-null   int64   
 29  TrainingTimesLastYear     2940 non-null   int64   
 30  WorkLifeBalance           2940 non-null   int64   
 31  YearsAtCompany            2940 non-null   int64   
 32  YearsInCurrentRole        2940 non-null   int64   
 33  YearsSinceLastPromotion   2940 non-null   int64   
 34  YearsWithCurrManager      2940 non-null   int64   
dtypes: category(9), int64(26)
memory usage: 624.6 KB

we can see that the memory usage has decreased from 804 KB to 624.4 KB, this technique is generally useful for bigger datasets.


Summary of the dataset.

data.describe().T

output:















  • EmployeeNumber is an ID variable and not useful for predictive modelling.

  • Age of the employees range from 18 to 60 years and the average age is 36 years.

  • EmployeeCount has only 1 as the value in all the rows and can be dropped as it will not be adding any information to our analysis.

  • Standard Hours has only 80 as the value in all the rows and can be dropped as it will not be adding any information to our analysis.

  • Hourly rate has a huge range, but we do not know what this variable stands for, yet. The same goes for daily and monthly rates.

  • Monthly Income has a high range and the difference in mean and median indicate the presence of outliers.


data.describe(include=['category']).T

output:

  • Attrition is our target variable with 84% records 'No' or employee will not attrite.

  • Majority of the employees have low business travel requirements

  • Majority of the employees are from the Research and Development department.

  • All employees are over 18 years of age - we can drop this variable as it will not be adding any information to our analysis.

  • There are more male employees than female employees.


Dropping columns which are not adding any information.

data.drop(['EmployeeNumber','EmployeeCount','StandardHours','Over18'],axis=1,inplace=True)

Let's look at the unqiue values of all the categories

cols_cat= data.select_dtypes(['category'])
for i in cols_cat.columns:
    print('Unique values in',i, 'are :')
    print(cols_cat[i].value_counts())
    print('*'*50)

output:

















EDA(Exploratory Data Analysis)

Univariate analysis¶

# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Observations on Age

histogram_boxplot(data,'Age')

output:












Bivariate Analysis

plt.figure(figsize=(20,10))
sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,fmt='.2f',cmap="Spectral")
plt.show()

output:












  • There are a few variables that are correlated with each other but there are no surprises here.

  • Unsurprisingly, TotalWorkingYears is highly correlated to Job Level (i.e., the longer you work the higher job level you achieve).

  • HourlyRate, DailyRate, and MonthlyRate are completely uncorrelated with each other which makes it harder to understand what these variables might represent.

  • MonthlyIncome is highly correlated to Job Level.

  • Age is positively correlated JobLevel and Education (i.e., the older an employee is, the more educated and at a higher job level they are).

  • Work-life Balance is correlated with none of the numeric values


sns.pairplot(data,hue='Attrition')
plt.show()

output:
















  • We can see varying distributions in variables for Attrition, we should investigate it further.


Attrition vs Earnings of employee

cols = data[['DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']].columns.tolist()
plt.figure(figsize=(10,10))

for i, variable in enumerate(cols):
                     plt.subplot(3,2,i+1)
                     sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

output:















  • Employees having lower Daily rate and less monthly wage are more likely to attrite.

  • Monthly rate and the hourly rate doesn't seem to have any effect on attrition.

  • Lesser salary hike also contributes to attrition


Model Building - Approach

  1. Data preparation

  2. Partition the data into train and test set.

  3. Build model on the train data.

  4. Tune the model if required.

  5. Test the data on test set.

Split Data

  • When classification problems exhibit a significant imbalance in the distribution of the target classes, it is good to use stratified sampling to ensure that relative class frequencies are approximately preserved in train and test sets.

  • This is done using the stratify parameter in the train_test_split function.


X = data.drop(['Attrition'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)

Model evaluation criterion Model can make wrong predictions as:

  1. Predicting an employee will attrite and the employee doesn't attrite

  2. Predicting an employee will not attrite and the employee attrites

Which case is more important?

  • Predicting that employee will not attrite but he attrites i.e. losing on a valuable employee or asset.

How to reduce this loss i.e need to reduce False Negatives?

  • Company wants Recall to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives or in other words identifying the true positives(i.e. Class 1) so that the company can provide incentives to control attrition rate especially for top-performers thereby optimizing the overall project cost in retaining the best talent.

Let's define function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not have use the same code repetitively while evaluating models.


# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)
    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages
    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Build Decision Tree Model

  • We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.

  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.

  • In this case, we can pass a dictionary {0:0.17,1:0.83} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.

  • class_weight is a hyperparameter for the decision tree classifier.


dtree = DecisionTreeClassifier(criterion='gini',class_weight={0:0.17,1:0.83},random_state=1)
dtree.fit(X_train, y_train)
confusion_matrix_sklearn(dtree, X_test, y_test)

output:










Confusion Matrix -

  • Employee left and the model predicted it correctly that is employee will attrite : True Positive (observed=1,predicted=1)

  • Employee didn't leave and the model predicted employee will attrite : False Positive (observed=0,predicted=1)

  • Employee didn't leave and the model predicted employee will not attrite : True Negative (observed=0,predicted=0)

  • Employee left and the model predicted that employee won't : False Negative (observed=1,predicted=0)


dtree_model_train_perf=model_performance_classification_sklearn(dtree, X_train, y_train)
print("Training performance \n",dtree_model_train_perf)

output:

Training performance 
    Accuracy  Recall  Precision   F1  ROC-AUC
0       1.0     1.0        1.0  1.0      1.0
dtree_model_test_perf=model_performance_classification_sklearn(dtree, X_test, y_test)
print("Testing performance \n",dtree_model_test_perf)

output:

Testing performance 
    Accuracy   Recall  Precision        F1   ROC-AUC
0  0.941043  0.84507        0.8  0.821918  0.902265
bottom of page