German Credit Analysis Using Python Data Science

realcode4you
Feb 12, 2022
6 min read

Objective

The objective is to build a model to predict whether a person would default or not. In this dataset, the target variable is 'Risk'.

Dataset Description

Age (Numeric: Age in years)
Sex (Categories: male, female)
Job (Categories : 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
Housing (Categories: own, rent, or free)
Saving accounts (Categories: little, moderate, quite rich, rich)
Checking account (Categories: little, moderate, rich)
Credit amount (Numeric: Amount of credit in DM - Deutsche Mark)
Duration (Numeric: Duration for which the credit is given in months)
Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
Risk (0 - Person is not at risk, 1 - Person is at risk(defaulter))

Importing libraries

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

Loading Data

# Loading the dataset
german = pd.read_csv("German_Credit.csv")
# Checking the number of rows and columns in the data
german.shape

output:

(1000, 10)

Data Overview

data = german.copy()
# let's view the first 5 rows of the data
data.head()

output:

# let's view the last 5 rows of the data
data.tail()

Output:

# let's check the data types of the columns in the dataset
data.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   int64 
dtypes: int64(5), object(5)
memory usage: 78.2+ KB

There are a total of 10 columns and 1,000 observations in the dataset
We can see that 2 columns have less than 1,000 non-null values i.e. columns have missing values.

# let's check for duplicate values in the data
data.duplicated().sum()

# let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)

Output:

Age                 0.000
Sex                 0.000
Job                 0.000
Housing             0.000
Saving accounts    18.300
Checking account   39.400
Credit amount       0.000
Duration            0.000
Purpose             0.000
Risk                0.000
dtype: float64

Saving accounts column has 18.3% missing values out of the total observations.
Checking account column has 39.4% missing values out of the total observations.
We will impute these values after splitting the data into train,validation and test sets.

Checking NULL values

# Checking for the null value in the dataset
data.isna().sum()

output:

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

Let's check the number of unique values in each column

data.nunique()

Output:

Age                  53
Sex                   2
Job                   4
Housing               3
Saving accounts       4
Checking account      3
Credit amount       921
Duration             33
Purpose               8
Risk                  2
dtype: int64

Age has only 53 unique values i.e. most of the customers are of similar age
We have only three continuous variables - Age, Credit Amount and Duration.
All other variables are categorical

# let's view the statistical summary of the numerical columns in the data
data.describe().T

Output:

Mean value for the age column is approx 35 and the median is 33. This shows that majority of the customers are under 35 years of age.
Mean amount of credit is approx 3,271 but it has a wide range of 250 to 18,424. We will explore this further in univariate analysis.
Mean duration for which the credit is given is approx 21 months.

Checking the value count for each category of categorical variables

# Making a list of all catrgorical variables
cat_col = [
    "Sex",
    "Job",
    "Housing",
    "Saving accounts",
    "Checking account",
    "Purpose",
    "Risk",
]

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 40)

Output:

male      690
female    310
Name: Sex, dtype: int64
----------------------------------------
2    630
1    200
3    148
0     22
Name: Job, dtype: int64
----------------------------------------
own     713
rent    179
free    108
Name: Housing, dtype: int64
----------------------------------------
little        603
moderate      103
quite rich     63
rich           48
Name: Saving accounts, dtype: int64
----------------------------------------
little      274
moderate    269
rich         63
Name: Checking account, dtype: int64
----------------------------------------
car                    337
radio/TV               280
furniture/equipment    181
business                97
education               59
repairs                 22
domestic appliances     12
vacation/others         12
Name: Purpose, dtype: int64
----------------------------------------
0    700
1    300
Name: Risk, dtype: int64
----------------------------------------

We have more male customers as compared to female customers
There are very few observations i.e. only 22 for customers with job category - unskilled and non-resident
We can see that the distribution of classes in the target variable is imbalanced i.e. only 30% observations with defaulters.

Univariate analysis

# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Observation on Age

# Observations on Customer_age
histogram_boxplot(data, "Age")

output:

Observations on Job

# observations on Job
labeled_barplot(data, "Job")

Output:

Majority of the customers i.e. 63% fall into the skilled category.
There are only approx 15% of customers that lie in the highly skilled category which makes sense as these may be the persons with high education or highly experienced.
There are very few observations, approx 22%, with 0 or 1 job category.

Bivariate Analysis

sns.pairplot(data, hue="Risk")

output:

There are overlaps i.e. no clear distinction in the distribution of variables for people who have defaulted and did not default.
Let's explore this further with the help of other plots.

sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")

Output:

We can see that the median age of defaulters is less than the median age of non-defaulters.
This shows that younger customers are more likely to default.
There are outliers in boxplots of both class distributions

Data Preparation for Modeling

Split data

X = df.drop(["Risk"], axis=1)
y = df["Risk"]

# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)

output:

(600, 9) (200, 9) (200, 9)

Missing-Value Treatment Using Dummies

We will use mode to impute missing values in Saving accounts and Checking account column.

# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Saving accounts", "Checking account"]

# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])

# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])

# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])

# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)

Model evaluation criterion

We will be using Recall as a metric for our model performance because here company could face 2 types of losses

Could Give loan to defaulters - Loss of money
Not give Loan to non-defaulters - Loss of opportunity

Which Loss is greater?

Giving loan to defaulters i.e Predicting a person not at risk, while actually person is at risk of making a default.

How to reduce this loss i.e need to reduce False Negatives?

Company wants recall to be maximized i.e. we need to reduce the number of false negatives.

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    score.append(scores)
    print("{}: {}".format(name, scores))

output:

Cross-Validation Performance:

Bagging: 24.444444444444446
Random forest: 24.444444444444446
GBM: 25.0
Adaboost: 25.0
Xgboost: 27.222222222222225
dtree: 43.33333333333333

Validation Performance:

Bagging: 0.2833333333333333
Random forest: 0.31666666666666665
GBM: 0.31666666666666665
Adaboost: 0.26666666666666666
Xgboost: 0.26666666666666666
dtree: 0.31666666666666665

Result Coparison

# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Output: