Classification & Typical Model In Machine Learning

In previous blog we learned about how to read data, do exploratory data analysis (EDA) and prepare data for training and training a ML model. However, we did not specifically discuss the typical ML pipeline. In this Blog we will go through a typical ML model development process using a classification task as an example.


The lab can be executed on either your own machine (with anaconda installation) or Google colab.


Objective

  • Continue to familiarise with Python and other ML packages

  • Learn to train a model for classification problem

  • Practice typical ML model development process.

Dataset

In this lab, we will be using the Cardiotocography Data Set from UCI Machine Learning Repository. The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians. 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was respect the fetal state (0=normal; 1=suspect; 2=pathologic). The version used in this lab is a modified version of the original dataset and, the columns of the original dataset are:

  1. LB - FHR baseline (beats per minute)

  2. AC - # of accelerations per second

  3. FM - # of fetal movements per second

  4. UC - # of uterine contractions per second

  5. DL - # of light decelerations per second

  6. DS - # of severe decelerations per second

  7. DP - # of prolongued decelerations per second

  8. ASTV - percentage of time with abnormal short term variability

  9. MSTV - mean value of short term variability

  10. ALTV - percentage of time with abnormal long term variability

  11. MLTV - mean value of long term variability

  12. Width - width of FHR histogram

  13. Min - minimum of FHR histogram

  14. Max - Maximum of FHR histogram

  15. Nmax - # of histogram peaks

  16. Nzeros - # of histogram zeros

  17. Mode - histogram mode

  18. Mean - histogram mean

  19. Median - histogram median

  20. Variance - histogram variance

  21. Tendency - histogram tendency

  22. TARGET: NSP - fetal state class code (0=normal; 1=suspect; 2=pathologic)

The task for this lab is to predict if a new fetal measurement is 0=normal; 1=suspect; 2=pathologic.

First, ensure the data file is located within the Jupyter workspace.

  • If you are on the local machine copy the file (Cardiotocography_Data_Set_subset.csv) to your current folder.

  • If you are on AWS you can upload the data to the notebook instance by clicking the upload files icon on the left sidebar.


Load the dataset and some cleaning

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv('./Cardiotocography_Data_Set_subset.csv', delimiter=',')
data.head()


Output:








Transform the target to match the task

print(set((data['NSP']).astype(np.int)))

Output:

{1, 2, 3}


data['NSP'] =  (data['NSP']).astype(np.int) - 1 
print(set((data['NSP']).astype(np.int)))

Output:

{0, 1, 2}


Usually we need to check if there are missing values and identify an action to handle them. Lets check if the data has any missing values. You can use the pandas describe to see if there are any columns with less number of items than others.


Describe the dataset

data.describe()

Output:







Find the missing value

� What observations did you make? ✔ Observations:

  • We can see that the mode column has only 2113 items while other columns all have 2126 items.

if there are missing values in the dataset, they are generally represented as Nan Values. Lets check for Nan values.


pd.isna(data).sum()

output:

LB 0 FM 0 ASTV 0 ALTV 0 Width 0 Nmax 0 Mode 13 Mean 0 Median 0 Variance 0 NSP 0 dtype: int64


The Mode column has 13 NaN values. We can find which instances/rows this corresponds to:

data[pd.isna(data).any(axis=1)]

Output












� What are the possible actions we can take?

Actions:

  • We can remove the above rows from the dataset. This will lead to loss of some information as we will lose the other attribute information in those rows.

  • We can replace the missing values with zero (or the mean of that column with missing values). Need to see if this is reasonable for a given attribute.

  • We can use another feature(s) to predict the missing values and use that.


For this problem we can observe that the Mode and the Median (or Mean) has a very strong correlation (See EDA results that appear later). therefore we can use the value of the Median to replace the missing values of Mode. Generally we might have to train a ML model to predict the missing attributes (x: median , y: Mode). However for this problem we can even directly replace the missing mode values without building a model.


data.loc[pd.isna(data['Mode']), 'Mode'] = data.loc[pd.isna(data['Mode']), 'Median']
pd.isna(data).sum()

Output


LB 0 FM 0 ASTV 0 ALTV 0 Width 0 Nmax 0 Mode 0 Mean 0 Median 0 Variance 0 NSP 0 dtype: int64


EDA

In the following, I have shown you several techniques that can be used to analyse the data for this dataset. However, this is not an exhaustive set of techniques.


☞ Task: We have leaned about doing an EDA in lab 02 and you are left to explore the possible techniques tor this problem. Do not limit your self to techniques presented in the class.


Lets first see if there are patterns in scatter plots of two variable at a time.


import seaborn as sns

g = sns.PairGrid(data, vars=['LB','FM' , 'ASTV' , 'ALTV', 'Width', 
                             'Nmax', 'Mode', 'Mean', 'Median', 'Variance'], hue="NSP")
g.map(sns.scatterplot)
plt.show()

Output:


� What observations did you make?

Observations:

  • Some plots show that a non-linear decision boundary might be able to separate the two classes. e.g. ASTV vs ASTL

  • Some plots show that a linear decision boundary might be able to separate the two classes. e.g. Median vs Variance


lets also observe the correlation plot

f, ax = plt.subplots(figsize=(11, 9))
corr = data.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=90,
    horizontalalignment='right'
);

Output:
















� What observations did you make?

Since we have discussed this in class, I will leave this as an exercise for you. Discuss with the lab demonstrator in class.

Another thing that is interesting is to see the class distribution

data['NSP'].hist(figsize=(5,5))
plt.xlabel('NSP')
plt.ylabel('frequency')
plt.show()

Output














� What observations did you make? Since we have discussed this in class, I will leave this as an exercise for you. Discuss with the lab demonstrator in class.

Typical Model development process As discussed in the lecture, the typical ML model development process consists of 4 steps, lets go through each and see how it is done.

  1. Determine your goals: Performance metric and target value. Problem dependent.

  2. Setup the experiment: Setup the test/validation data, visualisers and debuggers needed to determine bottlenecks in performance (overfitting/under-fitting, feature importance).

  3. Default Baseline Model: Identify the components of end-to-end pipeline including - Baseline Models, cost functions, optimisation.

  4. Make incremental changes: Repeatedly make incremental changes such as gathering new data, adjusting hyper-parameters, or changing algorithms, based on specific findings from your instrumentation.

Setting up the performance (evaluation) metric There are many performance metrics that apply to this problem such as accuracy_score, f1_score, etc. More information on performance metrics available in sklearn can be found at: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics The insights gained in the EDA becomes vital in determining the performance metric. Try to identify the characteristics that are important in making this decision from the EDA results. Use your judgment to pick the best performance measure - discuss with the lab demonstrator to see if the performance measure you came up with is appropriate. In this task, I want to give equal importance to all three classes therefore I will select macro-averaged f1_score as my performance measure and I wish to achieve a target value of 85% f1_score. F1-score is NOT the only performance measure that can be used for this problem. ☞ Task: Read this article on f1_score.

Setup the experiment - data splits Next what data should we use to evaluate the performance? We can generate "simulated" unseen data in several methods

  1. Hold-Out validation

  2. Cross-Validation

Usually you will select a technique that is most appropriate to the dataset given to you. However as we are interested in learning about the techniques Lets look at both techniques.

Hold-out Validation In hold out validation we divide the data into 3 subsets:

  1. Training: to obtaining the parameters or the weights of the hypothesis

  2. Validation: for tuning hyper-parameters and model selection.

  3. To evaluate the performance of the developed model. DO NOT use this split to set or tune any element of the model.

For this example lets divide the data into 60/20/20


from sklearn.model_selection import train_test_split

with pd.option_context('mode.chained_assignment', None):
    train_data_, test_data = train_test_split(data, test_size=0.2, 
                                              shuffle=True,random_state=0)
    
with pd.option_context('mode.chained_assignment', None):
    train_data, val_data = train_test_split(train_data_, test_size=0.25, 
                                            shuffle=True,random_state=0)
    
print(train_data.shape[0], val_data.shape[0], test_data.shape[0])

Output:

1275 425 426



Lets convert the data to np arrays

train_X = train_data.drop(['NSP',], axis=1).to_numpy()
train_y = train_data[['NSP']].to_numpy()

test_X = test_data.drop(['NSP',], axis=1).to_numpy()
test_y = test_data[['NSP']].to_numpy()

val_X = val_data.drop(['NSP',], axis=1).to_numpy()
val_y = val_data[['NSP']].to_numpy()


Lets setup some functions to get the performance.


from sklearn.metrics import f1_score

def get_f1_scores(clf, train_X, train_y, val_X, val_y):
    train_pred = clf.predict(train_X)
    val_pred = clf.predict(val_X)
    
    train_f1 = f1_score(train_y, train_pred, average='macro')
    val_f1 = f1_score(val_y, val_pred, average='macro')
    
    return train_f1, val_f1

Baseline model

We need to select a baseline mode to do this task. I am going to select regularised polynomial logistic regression for this example.


There are better models than this, however we only know logistic regression technique that can be used for this problem at the moment, therefore out choices are limited and the decision is simple. If we had other options, we need to use our knowledge on those techniques and the EDA to select the best base model.


The polynomial model is justified because in the EDA we can see that a non-linear decision boundary can separate the classes. regularisation is justified because we have correlated attributes and in EDA we also had some features where a linear decision boundary looked appropriate.



from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly.fit(train_X)
train_X = poly.transform(train_X)
test_X = poly.transform(test_X)
val_X = poly.transform(val_X)

When using polynomial features it is very important to scale the features. Lets do a minmax normalisation. Again you can leverage the EDA to select the appropriate scaling mechanism.


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_X)

train_X = scaler.transform(train_X)
val_X = scaler.transform(val_X)
test_X = scaler.transform(test_X)

Lets check the un-regularised linear model - just to check if everything is in order. You will notice a warning saying the max_iter was reached.


Ideally we would increase the number of maximum iterations and see if it solves the problem. For now lets ignore the warning.


from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, penalty='none', solver='saga', 
                         max_iter=1000, 
                         class_weight='balanced').fit(train_X, train_y.ravel())

train_f1, val_f1 = get_f1_scores(clf, train_X, train_y, val_X, val_y)
print("Train F1-Score score: {:.3f}".format(train_f1))
print("Validation F1-Score score: {:.3f}".format(val_f1))


Train F1-Score score: 0.858 Validation F1-Score score: 0.779

C:\Users\Farid\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn("The max_iter was reached which means "



For this task the baseline model achieved good training performance. However we can see a gap between the Train Accuracy and the Validation Accuracy (generalisation GAP). What can we do when there is a GAP between Train and Validation performance?

  • We can apply regularisation. The process is important. we start with a base model and then improve it based on our observations.

Apply regularisation When applying regularisation we need to select the lambda value. For this we can use

  1. Grid search

  2. Random search

We will do grid search in this example. in grid search we establish a set of lambda values in a frid. Selecting the range of lambda values is a process mostly done with trial and error. ones we select a set of lambda values, we train a classifier for each of those lambda values and evaluate the performance.


lambda_paras = np.logspace(-5, 1, num=25) # establish the lambda values to test (grid)



# Then search
train_performace = list()
valid_performace = list()

for lambda_para in lambda_paras:
    clf = LogisticRegression(penalty='l2', C = 1.0/lambda_para, 
                             random_state=0, solver='liblinear', max_iter=1000 , 
                             class_weight='balanced').fit(train_X, train_y.ravel())
    
    train_f1, val_f1 = get_f1_scores(clf, train_X, train_y, val_X, val_y)
    
    train_performace.append(train_f1)
    valid_performace.append(val_f1)



Now lets plot the training and validation performance for each lambda value in out lambda values set and see what is the best lambda value. You might have to repeat the process of selecting lambda values if the results are not as expected.


plt.plot([1.0/lambda_para for lambda_para in lambda_paras], 
         [tp for tp in train_performace], 'r-')
plt.plot([1.0/lambda_para for l