Objective
The objective is to build a model to predict whether a person would default or not. In this dataset, the target variable is 'Risk'.
Dataset Description
Age (Numeric: Age in years)
Sex (Categories: male, female)
Job (Categories : 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
Housing (Categories: own, rent, or free)
Saving accounts (Categories: little, moderate, quite rich, rich)
Checking account (Categories: little, moderate, rich)
Credit amount (Numeric: Amount of credit in DM - Deutsche Mark)
Duration (Numeric: Duration for which the credit is given in months)
Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
Risk (0 - Person is not at risk, 1 - Person is at risk(defaulter))
Importing libraries
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
Loading Data
# Loading the dataset
german = pd.read_csv("German_Credit.csv")
# Checking the number of rows and columns in the data
german.shape
output:
(1000, 10)
Data Overview
data = german.copy()
# let's view the first 5 rows of the data
data.head()
output:
# let's view the last 5 rows of the data
data.tail()
Output:
# let's check the data types of the columns in the dataset
data.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1000 non-null int64
1 Sex 1000 non-null object
2 Job 1000 non-null int64
3 Housing 1000 non-null object
4 Saving accounts 817 non-null object
5 Checking account 606 non-null object
6 Credit amount 1000 non-null int64
7 Duration 1000 non-null int64
8 Purpose 1000 non-null object
9 Risk 1000 non-null int64
dtypes: int64(5), object(5)
memory usage: 78.2+ KB
There are a total of 10 columns and 1,000 observations in the dataset
We can see that 2 columns have less than 1,000 non-null values i.e. columns have missing values.
# let's check for duplicate values in the data
data.duplicated().sum()
# let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
Output:
Age 0.000
Sex 0.000
Job 0.000
Housing 0.000
Saving accounts 18.300
Checking account 39.400
Credit amount 0.000
Duration 0.000
Purpose 0.000
Risk 0.000
dtype: float64
Saving accounts column has 18.3% missing values out of the total observations.
Checking account column has 39.4% missing values out of the total observations.
We will impute these values after splitting the data into train,validation and test sets.
Checking NULL values
# Checking for the null value in the dataset
data.isna().sum()
output:
Age 0
Sex 0
Job 0
Housing 0
Saving accounts 183
Checking account 394
Credit amount 0
Duration 0
Purpose 0
Risk 0
dtype: int64
Let's check the number of unique values in each column
data.nunique()
Output:
Age 53
Sex 2
Job 4
Housing 3
Saving accounts 4
Checking account 3
Credit amount 921
Duration 33
Purpose 8
Risk 2
dtype: int64
Age has only 53 unique values i.e. most of the customers are of similar age
We have only three continuous variables - Age, Credit Amount and Duration.
All other variables are categorical
# let's view the statistical summary of the numerical columns in the data
data.describe().T
Output:
Mean value for the age column is approx 35 and the median is 33. This shows that majority of the customers are under 35 years of age.
Mean amount of credit is approx 3,271 but it has a wide range of 250 to 18,424. We will explore this further in univariate analysis.
Mean duration for which the credit is given is approx 21 months.
Checking the value count for each category of categorical variables
# Making a list of all catrgorical variables
cat_col = [
"Sex",
"Job",
"Housing",
"Saving accounts",
"Checking account",
"Purpose",
"Risk",
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 40)
Output:
male 690
female 310
Name: Sex, dtype: int64
----------------------------------------
2 630
1 200
3 148
0 22
Name: Job, dtype: int64
----------------------------------------
own 713
rent 179
free 108
Name: Housing, dtype: int64
----------------------------------------
little 603
moderate 103
quite rich 63
rich 48
Name: Saving accounts, dtype: int64
----------------------------------------
little 274
moderate 269
rich 63
Name: Checking account, dtype: int64
----------------------------------------
car 337
radio/TV 280
furniture/equipment 181
business 97
education 59
repairs 22
domestic appliances 12
vacation/others 12
Name: Purpose, dtype: int64
----------------------------------------
0 700
1 300
Name: Risk, dtype: int64
----------------------------------------
We have more male customers as compared to female customers
There are very few observations i.e. only 22 for customers with job category - unskilled and non-resident
We can see that the distribution of classes in the target variable is imbalanced i.e. only 30% observations with defaulters.
Univariate analysis
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Observation on Age
# Observations on Customer_age
histogram_boxplot(data, "Age")
output:
Observations on Job
# observations on Job
labeled_barplot(data, "Job")
Output:
Majority of the customers i.e. 63% fall into the skilled category.
There are only approx 15% of customers that lie in the highly skilled category which makes sense as these may be the persons with high education or highly experienced.
There are very few observations, approx 22%, with 0 or 1 job category.
Bivariate Analysis
sns.pairplot(data, hue="Risk")
output:
There are overlaps i.e. no clear distinction in the distribution of variables for people who have defaulted and did not default.
Let's explore this further with the help of other plots.
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")
Output:
We can see that the median age of defaulters is less than the median age of non-defaulters.
This shows that younger customers are more likely to default.
There are outliers in boxplots of both class distributions
Data Preparation for Modeling
Split data
X = df.drop(["Risk"], axis=1)
y = df["Risk"]
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
output:
(600, 9) (200, 9) (200, 9)
Missing-Value Treatment Using Dummies
We will use mode to impute missing values in Saving accounts and Checking account column.
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Saving accounts", "Checking account"]
# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])
# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])
# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
Model evaluation criterion
We will be using Recall as a metric for our model performance because here company could face 2 types of losses
Could Give loan to defaulters - Loss of money
Not give Loan to non-defaulters - Loss of opportunity
Which Loss is greater?
Giving loan to defaulters i.e Predicting a person not at risk, while actually person is at risk of making a default.
How to reduce this loss i.e need to reduce False Negatives?
Company wants recall to be maximized i.e. we need to reduce the number of false negatives.
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
output:
Cross-Validation Performance:
Bagging: 24.444444444444446
Random forest: 24.444444444444446
GBM: 25.0
Adaboost: 25.0
Xgboost: 27.222222222222225
dtree: 43.33333333333333
Validation Performance:
Bagging: 0.2833333333333333
Random forest: 0.31666666666666665
GBM: 0.31666666666666665
Adaboost: 0.26666666666666666
Xgboost: 0.26666666666666666
dtree: 0.31666666666666665
Result Coparison
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Output:
Comments