Important Machine Learning Facts Which is Necessary to Became a Machine Learning Expert

realcode4you
Aug 2, 2022
6 min read

In this blog we will provide some practice task which help to improve machine learning deep concept.

Fact 1) What if Data is imabalanced?

1. As a part of this task you will observe how linear models work in case of data imbalanced 2. observe how hyper plane is changs according to change in your learning rate. 3. below we have created 4 random datasets which are linearly separable and having class imbalance 4. in the first dataset the ratio between positive and negative is 100 : 2, in the 2nd data its 100:20, in the 3rd data its 100:40 and in 4th one its 100:80

Import Necessary Packages

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, Normalizer
import matplotlib.pyplot as plt
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")

def draw_line(coef,intercept, mi, ma):
    # for the separating hyper plane ax+by+c=0, the weights are [a, b] and the intercept is c
    # to draw the hyper plane we are creating two points
    # 1. ((b*min-c)/a, min) i.e ax+by+c=0 ==> ax = (-by-c) ==> x = (-by-c)/a here in place of y we are keeping the minimum value of y
    # 2. ((b*max-c)/a, max) i.e ax+by+c=0 ==> ax = (-by-c) ==> x = (-by-c)/a here in place of y we are keeping the maximum value of y
    points=np.array([[((-coef[1]*mi - intercept)/coef[0]), mi],[((-coef[1]*ma - intercept)/coef[0]), ma]])
    plt.plot(points[:,0], points[:,1])

# here we are creating 2d imbalanced data points 
ratios = [(100,2), (100, 20), (100, 40), (100, 80)]
plt.figure(figsize=(20,5))
for j,i in enumerate(ratios):
    plt.subplot(1, 4, j+1)
    X_p=np.random.normal(0,0.05,size=(i[0],2))
    X_n=np.random.normal(0.13,0.02,size=(i[1],2))
    y_p=np.array([1]*i[0]).reshape(-1,1)
    y_n=np.array([0]*i[1]).reshape(-1,1)
    X=np.vstack((X_p,X_n))
    y=np.vstack((y_p,y_n))
    plt.scatter(X_p[:,0],X_p[:,1])
    plt.scatter(X_n[:,0],X_n[:,1],color='red')
plt.show()

Output:

your task is to apply SVM (sklearn.svm.SVC) and LR (sklearn.linear_model.LogisticRegression) with different regularization strength [0.001, 1, 100]

Task 1: Applying SVM

1. you need to create a grid of plots like this

in each of the cell[i][j] you will be drawing the hyper plane that you get after applying SVM on ith dataset and jth learnig rate i.e Plane(SVM().fit(D1, C=0.001))Plane(SVM().fit(D1, C=1))Plane(SVM().fit(D1, C=100))Plane(SVM().fit(D2, C=0.001))Plane(SVM().fit(D2, C=1))Plane(SVM().fit(D2, C=100))Plane(SVM().fit(D3, C=0.001))Plane(SVM().fit(D3, C=1))Plane(SVM().fit(D3, C=100))Plane(SVM().fit(D4, C=0.001))Plane(SVM().fit(D4, C=1))Plane(SVM().fit(D4, C=100))if you can do, you can represent the support vectors in different colors, which will help us understand the position of hyper plane Write in your own words, the observations from the above plots, and what do you think about the position of the hyper plane check the optimization problem here https://scikit-learn.org/stable/modules/svm.html#mathematical-formulation if you can describe your understanding by writing it on a paper and attach the picture, or record a video upload it in assignment.

Task 2: Applying LR

you will do the same thing what you have done in task 1.1, except instead of SVM you apply logistic regression

these are results we got when we are experimenting with one of the model

Fact 2) What if our features are with different variance

* As part of this task you will observe how linear models work in case of data having feautres with different variance * from the output of the above cells you can observe that var(F2)>>var(F1)>>Var(F3) > Task1: 1. Apply Logistic regression(SGDClassifier with logloss) on 'data' and check the feature importance 2. Apply SVM(SGDClassifier with hinge) on 'data' and check the feature importance > Task2: 1. Apply Logistic regression(SGDClassifier with logloss) on 'data' after standardization i.e standardization(data, column wise): (column-mean(column))/std(column) and check the feature importance 2. Apply SVM(SGDClassifier with hinge) on 'data' after standardization i.e standardization(data, column wise): (column-mean(column))/std(column) and check the feature importance

Import Necessary Packages

import numpy as np
import pandas as pd
import plotly
import plotly.figure_factory as ff
import plotly.graph_objs as go
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

Read Data

data = pd.read_csv('task_b.csv')
data=data.iloc[:,1:]
data.head()

output:

data.corr()['y']

output:

f1 0.067172

f2 -0.017944

f3 0.839060

y 1.000000

Name: y, dtype: float64

data.std()

output:

f1 488.195035

f2 10403.417325

f3 2.926662

y 0.501255

dtype: float64

X=data[['f1','f2','f3']].values
Y=data['y'].values
print(X.shape)
print(Y.shape)

output:

(200, 3)
(200,)

Fact 3) Collinear features and their effect on linear models

Import Necessary Packages

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

Read Data

data = pd.read_csv('task_d.csv')
data.head()

output:

X = data.drop(['target'], axis=1).values
Y = data['target'].values

Doing perturbation test to check the presence of collinearity Task: 1 Logistic Regression 1. Finding the Correlation between the features

a. check the correlation between the features

b. plot heat map of correlation matrix using seaborn heatmap

2. Finding the best model for the given data

a. Train Logistic regression on data(X,Y) that we have created in the above cell

b. Find the best hyper prameter alpha with hyper parameter tuning using k-fold cross validation (grid search CV or random search CV make sure you choose the alpha in log space)

c. Creat a new Logistic regression with the best alpha (search for how to get the best hyper parameter value), name the best model as 'best_model'

3. Getting the weights with the original data

a. train the 'best_model' with X, Y

b. Check the accuracy of the model 'best_model_accuracy'

c. Get the weights W using best_model.coef_

4. Modifying original data

a. Add a noise(order of 10^-2) to each element of X and get the new data set X' (X' = X + e)

b. Train the same 'best_model' with data (X', Y)

c. Check the accuracy of the model 'best_model_accuracy_edited'

d. Get the weights W' using best_model.coef_

5. Checking deviations in metric and weights

a. find the difference between 'best_model_accuracy_edited' and 'best_model_accuracy' b. find the absolute change between each value of W and W' ==> |(W-W')|

c. print the top 4 features which have higher % change in weights compare to the other feature

Task: 2 Linear SVM 1. Do the same steps (2, 3, 4, 5) we have done in the above task 1. Do write the observations based on the results you get from the deviations of weights in both Logistic Regression and linear SVM

Fact 4) Regression outlier effect.

Objective:Visualization best fit linear regression line for different scenarios

# you should not import any other packages
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.linear_model import SGDRegressor

import numpy as np
import scipy as sp
import scipy.optimize

def angles_in_ellipse(num,a,b):
    assert(num > 0)
    assert(a < b)
    angles = 2 * np.pi * np.arange(num) / num
    if a != b:
        e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
        tot_size = sp.special.ellipeinc(2.0 * np.pi, e)
        arc_size = tot_size / num
        arcs = np.arange(num) * arc_size
        res = sp.optimize.root(
            lambda x: (sp.special.ellipeinc(x, e) - arcs), angles)
        angles = res.x 
    return angles

a = 2
b = 9
n = 50

phi = angles_in_ellipse(n, a, b)
e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
arcs = sp.special.ellipeinc(phi, e)

fig = plt.figure()
ax = fig.gca()
ax.axes.set_aspect('equal')
ax.scatter(b * np.sin(phi), a * np.cos(phi))
plt.show()

output:

X= b * np.sin(phi)
Y= a * np.cos(phi)

1. As a part of this assignment you will be working the regression problem and how regularization helps to get rid of outliers 2. Use the above created X, Y for this experiment. 3. to do this task you can either implement your own SGDRegression(prefered) excatly similar to "SGD assignment" with mean sequared error or you can use the SGDRegression of sklearn, for example "SGDRegressor(alpha=0.001, eta0=0.001, learning_rate='constant',random_state=0)" note that you have to use the constant learning rate and learning rate eta0 initialized. 4. as a part of this experiment you will train your linear regression on the data (X, Y) with different regularizations alpha=[0.0001, 1, 100] and observe how prediction hyper plan moves with respect to the outliers 5. This the results of one of the experiment we did (title of the plot was not metioned intentionally)

in each iteration we were adding single outlier and observed the movement of the hyper plane. 6. please consider this list of outliers: [(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] in each of tuple the first elemet is the input feature(X) and the second element is the output(Y) 7. for each regularizer, you need to add these outliers one at time to data and then train your model again on the updated data. 8. you should plot a 3*5 grid of subplots, where each row corresponds to results of model with a single regularizer. 9. Algorithm: for each regularizer: for each outlier: #add the outlier to the data #fit the linear regression to the updated data #get the hyper plane #plot the hyperplane along with the data points 10. MAKE SURE YOU WRITE THE DETAILED OBSERVATIONS, PLEASE CHECK THE LOSS FUNCTION IN THE SKLEARN DOCUMENTATION (please do search for it).