top of page

German Credit Analysis Using Python Data Science

Objective

The objective is to build a model to predict whether a person would default or not. In this dataset, the target variable is 'Risk'.


Dataset Description

  • Age (Numeric: Age in years)

  • Sex (Categories: male, female)

  • Job (Categories : 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)

  • Housing (Categories: own, rent, or free)

  • Saving accounts (Categories: little, moderate, quite rich, rich)

  • Checking account (Categories: little, moderate, rich)

  • Credit amount (Numeric: Amount of credit in DM - Deutsche Mark)

  • Duration (Numeric: Duration for which the credit is given in months)

  • Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

  • Risk (0 - Person is not at risk, 1 - Person is at risk(defaulter))


Importing libraries

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

Loading Data

# Loading the dataset
german = pd.read_csv("German_Credit.csv")
# Checking the number of rows and columns in the data
german.shape

output:

(1000, 10)

Data Overview

data = german.copy()
# let's view the first 5 rows of the data
data.head()

output:






# let's view the last 5 rows of the data
data.tail()

Output:






# let's check the data types of the columns in the dataset
data.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   int64 
dtypes: int64(5), object(5)
memory usage: 78.2+ KB

  • There are a total of 10 columns and 1,000 observations in the dataset

  • We can see that 2 columns have less than 1,000 non-null values i.e. columns have missing values.

# let's check for duplicate values in the data
data.duplicated().sum()
# let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)

Output:

Age                 0.000
Sex                 0.000
Job                 0.000
Housing             0.000
Saving accounts    18.300
Checking account   39.400
Credit amount       0.000
Duration            0.000
Purpose             0.000
Risk                0.000
dtype: float64

  • Saving accounts column has 18.3% missing values out of the total observations.

  • Checking account column has 39.4% missing values out of the total observations.

  • We will impute these values after splitting the data into train,validation and test sets.

Checking NULL values

# Checking for the null value in the dataset
data.isna().sum()

output:

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

Let's check the number of unique values in each column

data.nunique()

Output:

Age                  53
Sex                   2
Job                   4
Housing               3
Saving accounts       4
Checking account      3
Credit amount       921
Duration             33
Purpose               8
Risk                  2
dtype: int64
  • Age has only 53 unique values i.e. most of the customers are of similar age

  • We have only three continuous variables - Age, Credit Amount and Duration.

  • All other variables are categorical


# let's view the statistical summary of the numerical columns in the data
data.describe().T

Output:







  • Mean value for the age column is approx 35 and the median is 33. This shows that majority of the customers are under 35 years of age.

  • Mean amount of credit is approx 3,271 but it has a wide range of 250 to 18,424. We will explore this further in univariate analysis.

  • Mean duration for which the credit is given is approx 21 months.


Checking the value count for each category of categorical variables

# Making a list of all catrgorical variables
cat_col = [
    "Sex",
    "Job",
    "Housing",
    "Saving accounts",
    "Checking account",
    "Purpose",
    "Risk",
]

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 40)

Output:

male      690
female    310
Name: Sex, dtype: int64
----------------------------------------
2    630
1    200
3    148
0     22
Name: Job, dtype: int64
----------------------------------------
own     713
rent    179
free    108
Name: Housing, dtype: int64
----------------------------------------
little        603
moderate      103
quite rich     63
rich           48
Name: Saving accounts, dtype: int64
----------------------------------------
little      274
moderate    269
rich         63
Name: Checking account, dtype: int64
----------------------------------------
car                    337
radio/TV               280
furniture/equipment    181
business                97
education               59
repairs                 22
domestic appliances     12
vacation/others         12
Name: Purpose, dtype: int64
----------------------------------------
0    700
1    300
Name: Risk, dtype: int64
----------------------------------------

  • We have more male customers as compared to female customers

  • There are very few observations i.e. only 22 for customers with job category - unskilled and non-resident

  • We can see that the distribution of classes in the target variable is imbalanced i.e. only 30% observations with defaulters.


Univariate analysis

# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Observation on Age

# Observations on Customer_age
histogram_boxplot(data, "Age")

output:












Observations on Job

# observations on Job
labeled_barplot(data, "Job")

Output:

















  • Majority of the customers i.e. 63% fall into the skilled category.

  • There are only approx 15% of customers that lie in the highly skilled category which makes sense as these may be the persons with high education or highly experienced.

  • There are very few observations, approx 22%, with 0 or 1 job category.



Bivariate Analysis

sns.pairplot(data, hue="Risk")

output:

























  • There are overlaps i.e. no clear distinction in the distribution of variables for people who have defaulted and did not default.

  • Let's explore this further with the help of other plots.


sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")

Output:











  • We can see that the median age of defaulters is less than the median age of non-defaulters.

  • This shows that younger customers are more likely to default.

  • There are outliers in boxplots of both class distributions



Data Preparation for Modeling

Split data

X = df.drop(["Risk"], axis=1)
y = df["Risk"]
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)

output:

(600, 9) (200, 9) (200, 9)

Missing-Value Treatment Using Dummies

  • We will use mode to impute missing values in Saving accounts and Checking account column.

# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Saving accounts", "Checking account"]

# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])

# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])

# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)

Model evaluation criterion

We will be using Recall as a metric for our model performance because here company could face 2 types of losses

  1. Could Give loan to defaulters - Loss of money

  2. Not give Loan to non-defaulters - Loss of opportunity

Which Loss is greater?

  • Giving loan to defaulters i.e Predicting a person not at risk, while actually person is at risk of making a default.

How to reduce this loss i.e need to reduce False Negatives?

  • Company wants recall to be maximized i.e. we need to reduce the number of false negatives.


models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    score.append(scores)
    print("{}: {}".format(name, scores))

output:

Cross-Validation Performance:

Bagging: 24.444444444444446
Random forest: 24.444444444444446
GBM: 25.0
Adaboost: 25.0
Xgboost: 27.222222222222225
dtree: 43.33333333333333

Validation Performance:

Bagging: 0.2833333333333333
Random forest: 0.31666666666666665
GBM: 0.31666666666666665
Adaboost: 0.26666666666666666
Xgboost: 0.26666666666666666
dtree: 0.31666666666666665

Result Coparison

# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Output:




Comments


bottom of page