How to Build Logistic Regression Prediction Model Using Given Dataset | Realcode4you

Building a Logistic Regression Model

Libraries Required

import numpy as np
from datetime import datetime
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.subplots as sp
import psycopg2
import warnings
warnings.filterwarnings('ignore')
print('Libraries imported')

Data Description

The variables description are as follow:

plcy_id	Auto Policy ID
sample	Identifies whether the record is available for training or is part of the holdout data
curnt_bi_low	Bodily Injury Coverage Individual Limit on the State Farm policy for which they are applying
curnt_bi_upp	Bodily Injury Coverage Occurence Limit on the State Farm policy for which they are applying
curnt_pd_lmt	Property Damage Coverage Limit on the State Farm policy for which they are applying
curnt_coll_ded	Collision Coverage Deductible on the State Farm policy for which they are applying (missing means no coverage)
curnt_comp_ded	Comprehensive Coverage Deductible on the State Farm policy for which they are applying (missing means no coverage)
hh_veh_cnt	Number of vehicles in the household
hh_cnt_auto	Number of automobiles in the household
hh_cnt_mtrcyc	Number of motorcycles in the household
hh_veh_w_coll_cnt	Number of vehicles in the household with Collision Coverage
hh_veh_w_comp_cnt	Number of vehicles in the household with Comprehensive Coverage
hh_veh_lien_cnt	Number of vehicles in the household owned but with an outstanding loan
hh_veh_lease_cnt	Number of vehicles in the household being leased
hh_veh_own_cnt	Number of vehicles in the household owned outright by the named insured
veh_ownership	Whether the vehicle on the policy is leased (Lease), owned outright (Own), or has an outstanding loan (Lien)
annual_mileage	Annual mileage the vehicle is driven
veh_make	Make of the vehicle on the policy
veh_model	Model of the vehicle on the policy
veh_age	Age of the vehicle on the policy
min_hh_veh_age	Minimum age of vehicles in the household
max_hh_veh_age	Maximum age of vehicles in the household
avg_hh_veh_age	Average age of vehicles in the household
hh_drvr_cnt	Number of drivers in the household
hh_min_age	Minimum age of drivers in the household
hh_max_age	Maximum age of drivers in the household
hh_avg_age	Average age of drivers in the household
hh_min_mon_lic	Minimum number of months licensed among drivers in the household
hh_max_mon_lic	Maximum number of months licensed among drivers in the household
hh_avg_mon_lic	Average number of months licensed among drivers in the household
hh_cnt_yth	Number of youthful drivers in the household (less than 18 years old)
hh_cnt_female	Number of female drivers in the household
hh_cnt_male	Number of male drivers in the household
hoh_married	Indicator of whether the head of household is married
hh_cnt_majr_viol	Total number of major violations for drivers in the household
hh_cnt_minr_viol	Total number of minor violations for drivers in the household
hh_cnt_lic_susp	Number of drivers with a suspended license in the household
prior_insurer	Name of the insurer immediately prior to coming to State Farm
time_w_carr	Number of months insured with the prior insurer
inforce_ind	Indicator of whether the household was insured immediately prior to coming to State Farm
multiline_ind	Indicator of whether the household is also applying for a homeowners or renters policy with State Farm
homeowner_ind	Indicator of whether the household owns their home
monthly_pay_ind	Indicator of whether the household will pay the policy premium on a monthly payment plan
credit_score	Credit score for the named insured driver (values of 0 indicate a credit report was not found)
hh_atf_clm_cnt_py1	Number of prior at-fault claims in the household within the last 1 year
hh_atf_clm_cnt_py2	Number of prior at-fault claims in the household within the last 2 years
hh_atf_clm_cnt_py3	Number of prior at-fault claims in the household within the last 3 years
hh_atf_clm_cnt_py4	Number of prior at-fault claims in the household within the last 4 years
hh_atf_clm_cnt_py5	Number of prior at-fault claims in the household within the last 5 years
hh_naf_clm_cnt_py1	Number of prior not-at-fault claims in the household within the last 1 year
hh_naf_clm_cnt_py2	Number of prior not-at-fault claims in the household within the last 2 years
hh_naf_clm_cnt_py3	Number of prior not-at-fault claims in the household within the last 3 years
hh_naf_clm_cnt_py4	Number of prior not-at-fault claims in the household within the last 4 years
hh_naf_clm_cnt_py5	Number of prior not-at-fault claims in the household within the last 5 years
future_clm_ind	Indicator of whether the household had a claim within the first year of being insured with State Farm

Read Dataset

path = 'D:/Freelancing/3-11-2023'
df = pd.read_csv(path+'/'+'DS_Work_Sample_Data.csv')
df.head()

output:

df.shape

out: (60000, 56)

df[df['future_clm_ind'].isna()].shape

out: (20000, 56)

df_test = df[df['future_clm_ind'].isna()].copy()
df_test.head()

output:

df[df['future_clm_ind']==0.0].shape

output:

(37532, 56)

df[df['future_clm_ind']==1.0].shape

output:

(2468, 56)

df_train = df[(df['future_clm_ind']==0.0) | (df['future_clm_ind']==1.0)]
df_test.shape

out:

(20000, 56)

df_train.shape

out: (40000, 56)

Visulization with Training Data

Model with Feature Engineering

# create fontdicts for formatting figure text
axtitle_dict = {'family': 'serif','color':  'red','weight': 'bold','size': 16}
axlab_dict = {'family': 'serif', 'color': 'black','size': 14}
colours = ['forestgreen','dodgerblue','goldenrod', 'coral' , 'silver' , 'gold' , 'dodgerblue','forestgreen','dodgerblue','goldenrod', 'coral' , 'silver',
          'forestgreen','dodgerblue','goldenrod', 'coral' , 'silver' , 'gold' , 'dodgerblue'];

Handling with Missing Values

Missing Data - Initial Intuition

Here, we have missing data. In case of missing value, General Thumb Rules:

For features with less missing values- We can use regression to predict the missing values or fill with the mean of the values present, depending on the feature.

For features with very high number of missing values- It is better to drop those columns as they give very less insight on analysis. As there's no thumb rule on what criteria do we delete the columns with high number of missing values, but generally we can delete the columns, if we have more than 30-40% of missing values.

# print number of rows of each attributes for which the value is NULL.
print(df_train.isna().sum().sort_values(ascending = False))
print('Number of Duplicate Values in df : ' ,df_train.duplicated().sum())

output:

In this case, more than 30% is considered as a boundary. Since, curnt_comp_ded is more than 30%. So, This feature is removed now. And for remaining feature:

Numerical feature missing values entries are removed.

df_train.drop(columns={'curnt_comp_ded'}, inplace=True)
## Removed useless features
df_train.drop(columns={'Unnamed: 0','plcy_id','sample'}, inplace=True)
# Missing means no coverage
df_train['curnt_coll_ded'].fillna(0, inplace=True)
# Removed the rows where 
df_train = df_train.dropna(subset=list(df_train.columns))
df_train=df_train[df_train['credit_score']!=0].copy()
# mode_hh_veh_own_cnt = df_train['hh_veh_own_cnt'].mode().values[0]
# df_train['hh_veh_own_cnt'].fillna(mode_hh_veh_own_cnt, inplace=True)
# mode_hh_veh_lease_cnt = df_train['hh_veh_lease_cnt'].mode().values[0]
# df_train['hh_veh_lease_cnt'].fillna(mode_hh_veh_lease_cnt, inplace=True)
# mode_hh_veh_lien_cnt = df_train['hh_veh_lien_cnt'].mode().values[0]
# df_train['hh_veh_lien_cnt'].fillna(mode_hh_veh_lien_cnt, inplace=True)
# mode_veh_ownership = df_train['veh_ownership'].mode().values[0]
# df_train['veh_ownership'].fillna(mode_veh_ownership, inplace=True)
# mean_annual_mileage = df_train['annual_mileage'].mean()
# df_train['annual_mileage'].fillna(mean_annual_mileage, inplace=True)

# print number of rows of each attributes for which the value is NULL.
print(df_train.isna().sum().sort_values(ascending = False))
print('Number of Duplicate Values in df : ' ,df_train.duplicated().sum())

output:

curnt_bi_low          0
curnt_bi_upp          0
hh_cnt_female         0
hh_cnt_male           0
hoh_married           0
hh_cnt_majr_viol      0
hh_cnt_minr_viol      0
hh_cnt_lic_susp       0
prior_insurer         0
time_w_carr           0
inforce_ind           0
multiline_ind         0
homeowner_ind         0
monthly_pay_ind       0
credit_score          0
hh_atf_clm_cnt_py1    0
hh_atf_clm_cnt_py2    0
hh_atf_clm_cnt_py3    0
hh_atf_clm_cnt_py4    0
hh_atf_clm_cnt_py5    0
hh_naf_clm_cnt_py1    0
hh_naf_clm_cnt_py2    0
hh_naf_clm_cnt_py3    0
hh_naf_clm_cnt_py4    0
hh_naf_clm_cnt_py5    0
hh_cnt_yth            0
hh_avg_mon_lic        0
hh_max_mon_lic        0
veh_ownership         0
curnt_pd_lmt          0
curnt_coll_ded        0
hh_veh_cnt            0
hh_cnt_auto           0
hh_cnt_mtrcyc         0
hh_veh_w_coll_cnt     0
hh_veh_w_comp_cnt     0
hh_veh_lien_cnt       0
hh_veh_lease_cnt      0
hh_veh_own_cnt        0
annual_mileage        0
hh_min_mon_lic        0
veh_make              0
veh_model             0
veh_age               0
min_hh_veh_age        0
max_hh_veh_age        0
avg_hh_veh_age        0
hh_drvr_cnt           0
hh_min_age            0
hh_max_age            0
hh_avg_age            0
future_clm_ind        0
dtype: int64
Number of Duplicate Values in df :  0

df_train.columns

out:

Index(['curnt_bi_low', 'curnt_bi_upp', 'curnt_pd_lmt', 'curnt_coll_ded',
       'hh_veh_cnt', 'hh_cnt_auto', 'hh_cnt_mtrcyc', 'hh_veh_w_coll_cnt',
       'hh_veh_w_comp_cnt', 'hh_veh_lien_cnt', 'hh_veh_lease_cnt',
       'hh_veh_own_cnt', 'veh_ownership', 'annual_mileage', 'veh_make',
       'veh_model', 'veh_age', 'min_hh_veh_age', 'max_hh_veh_age',
       'avg_hh_veh_age', 'hh_drvr_cnt', 'hh_min_age', 'hh_max_age',
       'hh_avg_age', 'hh_min_mon_lic', 'hh_max_mon_lic', 'hh_avg_mon_lic',
       'hh_cnt_yth', 'hh_cnt_female', 'hh_cnt_male', 'hoh_married',
       'hh_cnt_majr_viol', 'hh_cnt_minr_viol', 'hh_cnt_lic_susp',
       'prior_insurer', 'time_w_carr', 'inforce_ind', 'multiline_ind',
       'homeowner_ind', 'monthly_pay_ind', 'credit_score',
       'hh_atf_clm_cnt_py1', 'hh_atf_clm_cnt_py2', 'hh_atf_clm_cnt_py3',
       'hh_atf_clm_cnt_py4', 'hh_atf_clm_cnt_py5', 'hh_naf_clm_cnt_py1',
       'hh_naf_clm_cnt_py2', 'hh_naf_clm_cnt_py3', 'hh_naf_clm_cnt_py4',
       'hh_naf_clm_cnt_py5', 'future_clm_ind'],
      dtype='object')

Encoding

df_train['veh_ownership'].value_counts()

output:

Own      19480
Lien      9549
Lease      722
Name: veh_ownership, dtype: int64

df_train['veh_make'].value_counts()

out:

A    2015
D    1952
B    1899
C    1893
F    1694
E    1670
G    1608
H    1490
J    1433
I    1407
K    1359
L    1334
N    1294
M    1259
P    1199
O    1195
Q    1112
R    1105
S    1008
T     579
U     336
V     293
W     233
X     200
Y     109
Z      75
Name: veh_make, dtype: int64

# from sklearn.preprocessing import LabelEncoder
# Created a LabelEncoder object
# label_encoder = LabelEncoder()
# # Apply label encoding to the columns
# df_train['veh_make'] = label_encoder.fit_transform(df_train['veh_make'])
# df_train['veh_model'] = label_encoder.fit_transform(df_train['veh_model'])
# df_train['prior_insurer'] = label_encoder.fit_transform(df_train['prior_insurer'])
# df_train['veh_ownership'] = label_encoder.fit_transform(df_train['veh_ownership'])
# # One-Hot Encoding
df_train = pd.get_dummies(df_train, columns=['veh_make','veh_model','prior_insurer','veh_ownership'])

### Convert Output Feature Float to int 
df_train['future_clm_ind']=df_train['future_clm_ind'].astype(int)

df_train['future_clm_ind'].unique()

out:

array([0, 1])

df_train['future_clm_ind'].value_counts()

out:

0    27891
1     1860
Name: future_clm_ind, dtype: int64

Resampling The Data

from sklearn.utils import resample
# Separate the majority and minority classes
majority_class = df_train[df_train['future_clm_ind'] == 0]
minority_class = df_train[df_train['future_clm_ind'] == 1]
# Upsample the minority class to match the size of the majority class
minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)
# Combine the upsampled minority class with the majority class
df_train= pd.concat([majority_class, minority_upsampled])

df_train['future_clm_ind'].value_counts()

out:

0    27891
1    27891
Name: future_clm_ind, dtype: int64

Feature Engineering

df_train.columns

out:

Index(['curnt_bi_low', 'curnt_bi_upp', 'curnt_pd_lmt', 'curnt_coll_ded',
       'hh_veh_cnt', 'hh_cnt_auto', 'hh_cnt_mtrcyc', 'hh_veh_w_coll_cnt',
       'hh_veh_w_comp_cnt', 'hh_veh_lien_cnt',
       ...
       'prior_insurer_Rho', 'prior_insurer_Sigma', 'prior_insurer_Tau',
       'prior_insurer_Theta', 'prior_insurer_Upsilon', 'prior_insurer_Xi',
       'prior_insurer_Zeta', 'veh_ownership_Lease', 'veh_ownership_Lien',
       'veh_ownership_Own'],
      dtype='object', length=334)

num_cols = ['curnt_bi_low', 'curnt_bi_upp', 'curnt_pd_lmt', 'curnt_coll_ded',
           'annual_mileage', 'veh_age', 'min_hh_veh_age', 'max_hh_veh_age',
           'avg_hh_veh_age', 'hh_min_age', 'hh_max_age',
           'hh_avg_age', 'hh_min_mon_lic', 'hh_max_mon_lic', 'hh_avg_mon_lic',
          'time_w_carr','credit_score']
df_cor = df_train[num_cols].copy()

Box Plot / Outliers Analysis in Numerical Variables

#create figure with 3 x 3 grid of subplots
fig = plt.figure(figsize=[26,22])
fig.suptitle('DISTPLOT OF DATA', fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92);
fig.subplots_adjust(hspace=0.5, wspace=0.4);
for i ,col in enumerate(num_cols):
    ax = fig.add_subplot(6, 3, i+1)
    ax = sns.distplot(df_train[col],  color='dodgerblue')
    ax.axvline(df_train[col].quantile(q=0.25),color='green',linestyle='--',label='25% Quartile')
    ax.axvline(df_train[col].mean(),color='red',linestyle='--',label='Mean')
    ax.axvline(df_train[col].median(),color='black',linestyle='--',label='Median')
    ax.axvline(df_train[col].quantile(q=0.75),color='blue',linestyle='--',label='75% Quartile')
    # ax.text('skewness: {}' .format(str(round(df[col].skew(),3))), ha='right', va='center', size=11)
    ax.set_xlabel(f'{col}', fontdict=axlab_dict)
    ax.set_title(f'{col.upper()}    skewness {round(df[col].skew(),3)}', fontdict=axtitle_dict)
    ax.legend(fontsize=10)

out:

#create figure with 3 x 3 grid of subplots
fig = plt.figure(figsize=[26,22])
fig.suptitle('BOXPLOT OF DATA', fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92);
fig.subplots_adjust(hspace=0.5, wspace=0.4);
for i ,col in enumerate(num_cols):  
    ax1 = fig.add_subplot(6, 3, i+1);
    ax1 = sns.boxplot(data = df_train, x=col ,  color= colours[i]); 
    ax1.set_title(f'{col}', fontdict=axtitle_dict) 
    ax1.set_xlabel(f'{col}', fontdict=axlab_dict)

out:

# plot correlation matrix heatmap
fig, ax = plt.subplots(figsize=[13,5])
sns.heatmap(df_cor.corr(), ax=ax,  annot=True, linewidths=0.05, fmt= '.2f',cmap='RdBu')
ax.tick_params(axis='both', which='major', labelsize=14)
ax.set_title('Dataset Correlation Matrix', fontdict=axtitle_dict)
fig.show()

out:

import pandas as pd
import statsmodels.api as sm
vif_data = pd.DataFrame()
# Loop through each feature and calculate its VIF
for feature in df_cor.columns:
    X = df_cor.drop(columns=[feature])
    y = df_cor[feature]
    model = sm.OLS(y, sm.add_constant(X)).fit()
    vif = 1 / (1 - model.rsquared)
    vif_data = vif_data.append({'Feature': feature, 'VIF': vif}, ignore_index=True)
# Filter features with VIF less than 15
selected_features = vif_data[vif_data['VIF'] < 25]['Feature'].tolist()
print("Features with VIF < 25:", selected_features)

out:

Features with VIF < 25: ['curnt_bi_low', 'curnt_bi_upp', 'curnt_pd_lmt', 'curnt_coll_ded', 'annual_mileage', 'veh_age', 'time_w_carr', 'credit_score']

difference = list(set(num_cols) - set(selected_features))
final_features = list(set(list(df_train.columns))-set(difference))
difference

out:

['hh_avg_age',
 'hh_min_mon_lic',
 'max_hh_veh_age',
 'min_hh_veh_age',
 'hh_max_age',
 'avg_hh_veh_age',
 'hh_max_mon_lic',
 'hh_avg_mon_lic',
 'hh_min_age']

df_train['hh_veh_age_new']=(df_train['min_hh_veh_age']+df_train['max_hh_veh_age'])/2
df_train['hh_new_mon_lic']=(df_train['hh_max_mon_lic']+df_train['hh_min_mon_lic'])/2
df_train['hh_veh_age_new1']=(df_train['min_hh_veh_age']+1)*(1+df_train['max_hh_veh_age'])
df_train['hh_veh_age_new2'] = np.log(df_train['hh_veh_age_new1'])
df_train['hh_new_mon_lic1']=(df_train['hh_max_mon_lic']+1)*(1+df_train['hh_min_mon_lic'])
df_train['hh_new_mon_lic2'] = np.log(df_train['hh_new_mon_lic1'])
df_train['hh_new_age']=(df_train['hh_min_age']+1)*(df_train['hh_max_age']+1)
df_train['hh_new_age1'] = np.log(df_train['hh_new_age'])
df_train['annual_mileage1'] = np.log(df_train['annual_mileage'])
df_train['NOD']=df_train['hh_cnt_yth']+df_train['hh_cnt_female']+df_train['hh_cnt_male']
df_train['DV']=df_train['hh_cnt_majr_viol']+df_train['hh_cnt_minr_viol']+df_train['hh_cnt_lic_susp']
df_train['NOV']=df_train['hh_veh_cnt']+df_train['hh_cnt_auto']+df_train['hh_cnt_mtrcyc']+df_train['hh_veh_w_coll_cnt']+df_train['hh_veh_w_comp_cnt']+df_train['hh_veh_lien_cnt']+df_train['hh_veh_lease_cnt']+df_train['hh_veh_own_cnt']
df_train['ATF']=df_train['hh_atf_clm_cnt_py1']+df_train['hh_atf_clm_cnt_py2']+df_train['hh_atf_clm_cnt_py3']+df_train['hh_atf_clm_cnt_py4']+df_train['hh_atf_clm_cnt_py5']
df_train['ATF']=df_train['hh_naf_clm_cnt_py1']+df_train['hh_naf_clm_cnt_py2']+df_train['hh_naf_clm_cnt_py3']+df_train['hh_naf_clm_cnt_py4']+df_train['hh_naf_clm_cnt_py5']
# df_train.drop(columns={'min_hh_veh_age','max_hh_veh_age','hh_max_mon_lic','hh_min_mon_lic','hh_min_age','hh_max_age'},inplace=True)

Modelling

# X=df_train[li]
X = df_train.drop('future_clm_ind' , 1 )
y = df_train['future_clm_ind']
from sklearn.preprocessing import MinMaxScaler
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit the scaler on the combined data and transform it
X = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X , y  ,test_size = 0.33 , random_state = 42)

model_list = []
accuracy_list = []
recall_list = []
precision_list = []
f1_score_list= []

def Model_features(X_train , y_train , X_test , y_test , y_pred , classifier  , model_name):
#     fig ,ax = plt.subplots(figsize = (7,6))
    accuracy , precision , recall , f1_s = round(accuracy_score(y_test , y_pred) , 3) , round(precision_score(y_test, y_pred, average="micro")  ,3), round(recall_score(y_test , y_pred) ,3), round(f1_score(y_test , y_pred) , 3)
    print(f'Accuracy Score is :{accuracy}')
    print(f'Precision Score is :{precision}')
    print(f'Recall Score is :{recall}')
    print(f'f1  Score is :{f1_s}')
    model_list.append(model_name)
    accuracy_list.append(accuracy)
    recall_list.append(recall)
    precision_list.append(precision)
    f1_score_list.append(f1_s)
   
#     print(f'f1  Score is :{round(specificity_score(y_test , y_pred) , 3)}')
    print(metrics.classification_report(y_test, y_pred))

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import recall_score , classification_report , confusion_matrix  ,roc_curve , roc_auc_score , accuracy_score
from sklearn.metrics import precision_recall_curve , auc ,f1_score , precision_score , recall_score
from sklearn.model_selection import cross_val_score

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Create an LDA object
lda = LinearDiscriminantAnalysis()

from sklearn.model_selection import GridSearchCV
# Define hyperparameters and their possible values
param_grid = {

    'C': [0.001, 0.01, 0.1,0.15],  # Regularization parameter

    'penalty': ['l1', 'l2'],              # Regularization type

    'max_iter': [100,180,200,210],        # Maximum number of iterations

}
# Create a logistic regression model
model_lr = LogisticRegression()
# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(model_lr, param_grid, cv=5, scoring='accuracy')
# Fit the grid search to your data
grid_search.fit(X_train, y_train)
# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Parameters: ",best_params)
# Train the final model with the best hyperparameters
best_model = LogisticRegression(**best_params)
# Create a pipeline combining LDA and Logistic Regression
best_model = Pipeline([('lda', lda), ('logistic', best_model)])
best_model.fit(X_train, y_train)
# Use cross-validation to assess the model's performance
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
# Calculate the mean accuracy score from cross-validation
mean_accuracy = cv_scores.mean()
print("Cross-Validation Mean Accuracy:", mean_accuracy)
# Predict on the test data
y_pred = best_model.predict(X_test)
# Evaluate the model on the test data
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", test_accuracy)

out:

Best Parameters:  {'C': 0.15, 'max_iter': 100, 'penalty': 'l2'}
Cross-Validation Mean Accuracy: 0.6822302940047236
Test Accuracy: 0.682600901732848

Without Hyperparameter Tunning

# Create a logistic regression model
model_lr = LogisticRegression(random_state=0)
# Fit the model on the training data
model_lr.fit(X_train, y_train)
# Use cross-validation to assess the model's performance
cv_scores = cross_val_score(model_lr, X_train, y_train, cv=5, scoring='accuracy')
# Calculate the mean accuracy score from cross-validation
mean_accuracy = cv_scores.mean()
print("Cross-Validation Mean Accuracy:", mean_accuracy)
# Predict on the test data
y_pred = model_lr.predict(X_test)
# Evaluate the model on the test data
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", test_accuracy)

out:

Cross-Validation Mean Accuracy: 0.6820430495729678
Test Accuracy: 0.6835786843391819

Model Evaluation

from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Predict probabilities of the positive class
y_prob = best_model.predict_proba(X_test)[:, 1]
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
# Calculate the AUC (Area Under the Curve)
roc_auc = roc_auc_score(y_test, y_prob)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

Model_features(X_train , y_train , X_test , y_test  , y_pred , model_lr , "Logistic Regression")

out:

Accuracy Score is :0.684
Precision Score is :0.684
Recall Score is :0.692
f1  Score is :0.686
              precision    recall  f1-score   support

           0       0.69      0.68      0.68      9205
           1       0.68      0.69      0.69      9204

    accuracy                           0.68     18409
   macro avg       0.68      0.68      0.68     18409
weighted avg       0.68      0.68      0.68     18409

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=best_model.classes_)
disp.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix')
plt.show()

out:

# Get coefficients and their magnitudes
coefficients = model_lr.coef_
feature_names = df_train.columns
# Calculate the magnitude of coefficients
coeff_magnitudes = abs(coefficients)
# Sort features by magnitude
sorted_features = sorted(zip(feature_names, coeff_magnitudes[0]), key=lambda x: x[1], reverse=True)
# Print the features and their magnitudes
for feature, magnitude in sorted_features:
    print(f"Feature: {feature}, Magnitude: {magnitude}")

out:

Feature: hh_veh_age_new2, Magnitude: 4.987714406869197
Feature: hh_veh_w_comp_cnt, Magnitude: 4.247272756209435
Feature: hh_new_mon_lic2, Magnitude: 3.3822582934243792
Feature: veh_model_U9, Magnitude: 2.9466963024771693
Feature: hh_new_age1, Magnitude: 2.8501033829418536
Feature: hh_veh_lien_cnt, Magnitude: 2.177153370671304
Feature: hh_cnt_lic_susp, Magnitude: 2.091976999078532
Feature: veh_model_L7, Magnitude: 1.8821370904843262
Feature: hh_atf_clm_cnt_py3, Magnitude: 1.8798106446510203
Feature: veh_model_M7, Magnitude: 1.8390724014369997
Feature: veh_model_Q8, Magnitude: 1.752096811969229
Feature: veh_model_J8, Magnitude: 1.7508455842446162
Feature: veh_model_T2, Magnitude: 1.6533504619092458
Feature: veh_model_U8, Magnitude: 1.6458480789281105
...
...

Generate Prediction

df.shape

out:

(60000, 56)

# print number of rows of each attributes for which the value is NULL.
print(df.isna().sum().sort_values(ascending = False))
print('Number of Duplicate Values in df : ' ,df.duplicated().sum())

out:

future_clm_ind        20000
curnt_comp_ded        18564
curnt_coll_ded        14616
hh_veh_lease_cnt       7746
hh_veh_lien_cnt        7746
hh_veh_own_cnt         7746
veh_ownership          6850
annual_mileage         1928
hh_naf_clm_cnt_py2        0
monthly_pay_ind           0
hoh_married               0
hh_cnt_majr_viol          0
hh_cnt_minr_viol          0
hh_cnt_lic_susp           0
prior_insurer             0
time_w_carr               0
inforce_ind               0
multiline_ind             0
homeowner_ind             0
hh_atf_clm_cnt_py1        0
credit_score              0
hh_naf_clm_cnt_py3        0
hh_cnt_female             0
hh_naf_clm_cnt_py5        0
hh_naf_clm_cnt_py4        0
hh_atf_clm_cnt_py2        0
hh_atf_clm_cnt_py3        0
hh_atf_clm_cnt_py4        0
hh_atf_clm_cnt_py5        0
hh_naf_clm_cnt_py1        0
hh_cnt_male               0
Unnamed: 0                0
hh_cnt_yth                0
hh_avg_mon_lic            0
sample                    0
curnt_bi_low              0
curnt_bi_upp              0
curnt_pd_lmt              0
hh_veh_cnt                0
hh_cnt_auto               0
hh_cnt_mtrcyc             0
hh_veh_w_coll_cnt         0
hh_veh_w_comp_cnt         0
veh_make                  0
veh_model                 0
veh_age                   0
min_hh_veh_age            0
max_hh_veh_age            0
avg_hh_veh_age            0
hh_drvr_cnt               0
hh_min_age                0
hh_max_age                0
hh_avg_age                0
plcy_id                   0
hh_max_mon_lic            0
hh_min_mon_lic            0
dtype: int64
Number of Duplicate Values in df :  0

df.drop(columns={'curnt_comp_ded'}, inplace=True)
## Removed useless features
df.drop(columns={'Unnamed: 0','plcy_id','sample'}, inplace=True)
# Missing means no coverage
df['curnt_coll_ded'].fillna(0, inplace=True)
# Removed the rows where 
# df_test = df_test.dropna(subset=list(df_test.columns))
# df_test=df_test[df_test['credit_score']!=0].copy()
mode_hh_veh_own_cnt = df['hh_veh_own_cnt'].mode().values[0]
df['hh_veh_own_cnt'].fillna(mode_hh_veh_own_cnt, inplace=True)
mode_hh_veh_lease_cnt = df['hh_veh_lease_cnt'].mode().values[0]
df['hh_veh_lease_cnt'].fillna(mode_hh_veh_lease_cnt, inplace=True)
mode_hh_veh_lien_cnt = df['hh_veh_lien_cnt'].mode().values[0]
df['hh_veh_lien_cnt'].fillna(mode_hh_veh_lien_cnt, inplace=True)
mode_veh_ownership = df['veh_ownership'].mode().values[0]
df['veh_ownership'].fillna(mode_veh_ownership, inplace=True)
mean_annual_mileage = df['annual_mileage'].mean()
df['annual_mileage'].fillna(mean_annual_mileage, inplace=True)

# df_test.drop(columns={'curnt_comp_ded'}, inplace=True)
# ## Removed useless features
# df_test.drop(columns={'Unnamed: 0','plcy_id','sample'}, inplace=True)
# # Missing means no coverage
# df_test['curnt_coll_ded'].fillna(0, inplace=True)
# # Removed the rows where 
# df_test = df_test.dropna(subset=list(df_test.columns))
# df_test=df_test[df_test['credit_score']!=0].copy()
df = pd.get_dummies(df, columns=['veh_make','veh_model','prior_insurer','veh_ownership'])
df['hh_veh_age_new']=(df['min_hh_veh_age']+df['max_hh_veh_age'])/2
df['hh_new_mon_lic']=(df['hh_max_mon_lic']+df['hh_min_mon_lic'])/2
df['hh_veh_age_new1']=(df['min_hh_veh_age']+1)*(1+df['max_hh_veh_age'])
df['hh_veh_age_new2'] = np.log(df['hh_veh_age_new1'])
df['hh_new_mon_lic1']=(df['hh_max_mon_lic']+1)*(1+df['hh_min_mon_lic'])
df['hh_new_mon_lic2'] = np.log(df['hh_new_mon_lic1'])
df['hh_new_age']=(df['hh_min_age']+1)*(df['hh_max_age']+1)
df['hh_new_age1'] = np.log(df['hh_new_age'])
df['annual_mileage1'] = np.log(df['annual_mileage'])
df['NOD']=df['hh_cnt_yth']+df['hh_cnt_female']+df['hh_cnt_male']
df['DV']=df['hh_cnt_majr_viol']+df['hh_cnt_minr_viol']+df['hh_cnt_lic_susp']
df['NOV']=df['hh_veh_cnt']+df['hh_cnt_auto']+df['hh_cnt_mtrcyc']+df['hh_veh_w_coll_cnt']+df['hh_veh_w_comp_cnt']+df['hh_veh_lien_cnt']+df['hh_veh_lease_cnt']+df['hh_veh_own_cnt']
df['ATF']=df['hh_atf_clm_cnt_py1']+df['hh_atf_clm_cnt_py2']+df['hh_atf_clm_cnt_py3']+df['hh_atf_clm_cnt_py4']+df['hh_atf_clm_cnt_py5']
df['ATF']=df['hh_naf_clm_cnt_py1']+df['hh_naf_clm_cnt_py2']+df['hh_naf_clm_cnt_py3']+df['hh_naf_clm_cnt_py4']+df['hh_naf_clm_cnt_py5']
# df_train.drop(columns={'min_hh_veh_age','max_hh_veh_age','hh_max_mon_lic','hh_min_mon_lic','hh_min_age','hh_max_age'},inplace=True)

df.columns

out:

Index(['curnt_bi_low', 'curnt_bi_upp', 'curnt_pd_lmt', 'curnt_coll_ded',
       'hh_veh_cnt', 'hh_cnt_auto', 'hh_cnt_mtrcyc', 'hh_veh_w_coll_cnt',
       'hh_veh_w_comp_cnt', 'hh_veh_lien_cnt',
       ...
       'hh_veh_age_new2', 'hh_new_mon_lic1', 'hh_new_mon_lic2', 'hh_new_age',
       'hh_new_age1', 'annual_mileage1', 'NOD', 'DV', 'NOV', 'ATF'],
      dtype='object', length=347)

df.head()

out:

# X=df_train[li]
X = df.drop('future_clm_ind' , 1 )
from sklearn.preprocessing import MinMaxScaler
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit the scaler on the combined data and transform it
X = scaler.fit_transform(X)
y_test_pred = best_model.predict(X)

df_predictions = pd.DataFrame({'glm_pred': y_test_pred})

df_predictions['plcy_id'] = df_predictions.reset_index().index

df_predictions['plcy_id']=df_predictions['plcy_id']+1

df_predictions.sample(10)

out:

sel_cols = ['plcy_id','glm_pred']
df_pred = df_predictions[sel_cols].copy()
df_pred=df_pred[df_pred['glm_pred']==1].copy()
df_pred.to_csv(path+'/'+'DS_Worked_Sample_Scored.csv')

RealCode4You

How to Build Logistic Regression Prediction Model Using Given Dataset | Realcode4you

Recent Posts