Integrating ML and DL to Create a Hybrid Model for Fraud Detection in Dredit Dards

Project Aim

I aim to construct an advanced model geared towards identifying cyber fraud in credit card transactions using a blend of Machine Learning (ML) and Deep Learning (DL) techniques.

Requirements

Credit card fraud detection using New model( ML+DL)

My initial model to detect fraud in credit card as below:

Data sets:

Try the following 2 datasets:

Dataset2:

Synthetic data from a financial payment system (kaggle.com)

Synthetic data from a financial payment system

Requirements:

Update and improve the above initial model according to your methodology.
Examine the datasets thoroughly, including creating visual representations to understand how different features relate to non-fraudulent and fraudulent cases.
Clean and prepare the data for analysis.
Choose the most relevant features from the dataset.
Divide the data into two parts: one for training the model and one for testing its performance.
Use machine learning models like Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), XGBClassifier, CatBoostClassifier and any other effective algorithms. (you do not need to use all).
Use deep learning models such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (BLSTM), DCN, ANN, RGU, GNN and any other efficient algorithms. (you do not need to use all).
Use Tuning hyperparameters.
Implement effective sampling techniques to address data imbalances. Such as SMOTE, ENN, Tomek, Random sampling or combination. Justify why using this type of technique and not using others.
Show results before and after using sampling.
Collect predictions from each individual algorithm and combine their predictions to create a new model using any techniques such as Staking, voting or any efficient technique.
Utilize ensemble techniques to enhance model performance.
Evaluate the model's performance using metrics like accuracy, precision, recall, F1 score, ROC AUC score, and create a confusion matrix. Also, employ cross-validation for robust evaluation.
Generate charts and visualizations to present the results of the study.
Supply comprehensive instructions along with detailed explanations of each step and the corresponding code.
Justify every single step. why you use this and not use others.
Provide the discussion chapter (in separate document) after getting results and explain your results and the effectiveness of the new model. this discussion chapter includes explain of the new model, what is the innovation in this new model, the algorithm and methodology has been used, results and findings, analysis of results, comparison with some related work (choose some models in the literature) that used same techniques to prove that the new model has shown high performance comparing with the related work. Please explain your innovation methodology (algorithm) and clarify that is the new in your model and what are the different between this model and other in related work.
use Jupyter to write and run code.
The requirements are based on what I know. Do not rely on my provided code. You're welcome to create an efficient and high-level model. This model is at a Ph.D. level, so aim to develop a new and creative model that has not been used before in the state-of-art. You do not need to use all ML and DL mentioned above, exclude some if this leads to high performance and results. Go deep in these ML and DL techniques (update structures) and create innovation model. Do not use classic structures of algorithms.
My primary focus is creating an advanced, innovative model for detecting credit card fraud. It's crucial that this model is novel, inventive, and has not been previously documented in existing literature. You're encouraged to innovate, introducing fresh methodologies and developing a unique model specifically for this purpose.

Implementation

Import Libraries

import numpy as np
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from sklearn.preprocessing import MinMaxScaler
#Feature Selection
import pickle
#Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
#Model evaluation
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import classification_report, accuracy_score
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
#Import svm model
from sklearn import svm
##display
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# visual libraries
import matplotlib.gridspec as gridspec
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('ggplot')
## disable warnings
import warnings
warnings.filterwarnings('ignore')
from mlxtend.classifier import StackingClassifier
from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
# import the libraries
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Conv1D,BatchNormalization,Dropout
from tensorflow.keras.layers import GlobalMaxPooling2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import LSTM, Dense, GRU, Bidirectional
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC

Examine the data thoroughly, including creating visual representations to understand how different features relate to non-fraudulent and fraudulent cases.

data = pd.read_csv('creditcardfull.csv')
pd.options.display.float_format = '{:,.2f}'.format
data.head()

len(data)

out:

%%time
plt.figure(figsize=(15,5))
plt.title('Time Distribution')
sns.distplot(data.Time)

out:

CPU times: total: 2.86 s Wall time: 3.08 s

<Axes: title={'center': 'Time Distribution'}, xlabel='Time', ylabel='Density'>

# As the time is too random and complex a variable to make significant impact to our analysis weDrop the 'Time' column as it's not needed for the analysis
data = data.drop(['Time'], axis=1)

# The main feature are
features = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11','V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21','V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']

### display the number, highly imbalanced dataset
data['Class'].value_counts().plot(kind='bar')
plt.title('Bank Fraud vs no fraud Histogram')
plt.xlabel('Frauds')
plt.ylabel('Frequency')
plt.show()
print("The split between frauds and non frauds cases", data['Class'].value_counts())
#Higher values of amount is distorting the graph. So we fix it plotting only 0-99% of amount.

out:

The split between frauds and non frauds cases Class 0 284315 1 492 Name: count, dtype: int64

# Create subplots for visualizing features for each cases fraud and non fraud
fig, axes = plt.subplots(nrows=9, ncols=3, figsize=(15, 18))
fig.suptitle('Features vs fraud class\n', size=18)
# Create boxplots for each feature
for i, feature in enumerate(features[:-2]):
    row, col = i // 3, i % 3  # Calculate the row and column for the subplot
    # Create a boxplot for the feature grouped by 'Class' using the viridis palette
    sns.boxplot(ax=axes[row, col], data=data, x='Class', y=feature, palette='viridis')
    axes[row, col].set_title(f"{feature} Distribution")

out:

# distribution of anomalous features
anomalous_features = data.iloc[:,1:29].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(data[anomalous_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(data[cn][data.Class == 1], bins=50)
    sns.distplot(data[cn][data.Class == 0], bins=50)
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))
plt.show()

out:

2- Clean and prepare the data for analysis

plt.figure(figsize = (14,14))
plt.title('Credit Card Transactions features correlation plot (Pearson)')
corr = data.corr()
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,linewidths=.1,cmap="Reds")
plt.show()

out:

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20, 15))
sns.heatmap(data.corr(), cmap='PiYG', annot=True, linewidths=1, fmt='0.2f')
plt.show()

out:

3- Divide the data into two parts: one for training the model and one for testing its performance. The industry standard is 80% used for training and 20% used for testing, this split is used in all the algorithms.

# Preprocess the data
X = data.drop('Class', axis=1)
y = data['Class']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_test = X_test.dropna()
y_test = y_test.dropna()
# Standardize the features (optional but often recommended)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

fraudulent_count_full = data['Class'].sum()  # Total fraudulent cases in the full dataset
fraudulent_count_train = y_train.sum()  # Total fraudulent cases in the training set
fraudulent_count_test = y_test.sum()  # Total fraudulent cases in the test set
print("Fraudulent Count for Full data:", fraudulent_count_full)
print("Fraudulent Count for Train data:", fraudulent_count_train)
print("Fraudulent Count for Test data:", fraudulent_count_test)

out:

Fraudulent Count for Full data: 492 Fraudulent Count for Train data: 394 Fraudulent Count for Test data: 98

def model_metrics(y_test, model , model_name, X_train, y_train):
  # Perform cross-validation (e.g., with 5 folds)
  cross_val_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
  # Print cross-validation scores
  print("Cross-validation scores:", cross_val_scores)
  print("Mean accuracy:", np.mean(cross_val_scores))
  # Evaluating the classifier
  y_pred = model.predict(X_test)
  print(f"The model used is {model_name}")
  prec = precision_score(y_test, y_pred)
  print(f"The precision score is {prec}")
  rec = recall_score(y_test, y_pred)
  print(f"The recall score is {rec}")
  f1 = f1_score(y_test, y_pred)
  print(f"The f1 score is {f1}")
  roc_auc = roc_auc_score(y_test, y_pred)
  print(f"The roc_auc_score score is {roc_auc}")
  confusion_mat = confusion_matrix(y_test, y_pred)
  plt.figure(figsize=(8, 8))
  sns.heatmap(confusion_mat, annot=True, fmt="d", cmap="Blues", xticklabels=["Not Fraud", "Fraud"], yticklabels=["Not Fraud", "Fraud"])
  plt.xlabel('Predicted')
  plt.ylabel('Actual')
  plt.title('Confusion Matrix')
  plt.show()
  y_pred_proba = model.predict_proba(X_test)[:, 1]
  # Calculate ROC curve and AUC
  fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
  roc_auc = auc(fpr, tpr)
  # Plot ROC curve
  plt.figure(figsize=(8, 6))
  plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc))
  plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver Operating Characteristic (ROC) Curve')
  plt.legend(loc='lower right')
  plt.show()

Contact Us to get complete solution of this project at:

realcode4you@gmail.com