Random Sampling, Straitified Sampling, K- mean(elbow), MDS In Machine Learning | ML Assignment Help

realcode4you
Apr 19, 2020
2 min read

House-price-Prediction

Import Libraries

%matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.manifold import MDS

Random Sampling

Random sampling approach (i.e train_test_split), using a test size of 30% of data and a random_state of 42.

# X--> feature set ,, y --> target variable x = df1.drop(['id', 'price'],axis=1) y = df1['price'] x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.30,random_state =42) print('shapes of training and test set ') x_train.shape,x_test.shape

Straitified Sampling

target = 'price' X = df1.drop(target, axis = 'columns', inplace = False) Y = df1[target]

#method: 2 df2 = df1[df1[target].isin(df1[target].value_counts()[df1[target].value_counts()>2].index)] y2 = df2[target] X2 = df2.fillna(0)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.33, random_state=42, stratify=X2[target])

X2_train.shape,X2_test.shape

K- mean(elbow)

The Elbow method is a very popular technique and the idea is to run k-means clustering for a range of clusters k (let’s say from 1 to 10) and for each value, we are calculating the sum of squared distances from each point to its assigned center(distortions).

from matplotlib import style from sklearn.cluster import KMeans

df1 = df1.drop('date', axis = 'columns', inplace = False)

distortions = [] K = range(1,11) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(df1) distortions.append(kmeanModel.inertia_)

#plot

plt.figure(figsize=(8,2)) plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show()

Dimension reduction on both org and 2 types of reduced data using PCA

#import libraries

from sklearn.decomposition import PCA model = PCA()

#fit into model

model.fit(df1)

#transform model

transformed = model.transform(df1) print('Principle components: ',model.components_)

# PCA variance from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df1 = scaler.fit_transform(df1) pca = PCA() pca.fit_transform(df1) pca_variance = pca.explained_variance_ plt.bar(range(pca.n_components_), pca_variance) plt.xlabel('PCA feature') plt.ylabel('variance') plt.show()

Intrinsic dimension

PCA identifies intrinsic dimension when samples have any number of features

intrinsic dimension = number of PCA feature with significant variance

In order to choose intrinsic dimension try all of them and find best accuracy

#color_list=['black','gray'] pca = PCA(n_components = 3) pca.fit(df1) transformed = pca.transform(df1) transformed.shape

I hope this may help you to understand basic flow of data science concept, if you are face any other issue or need any assignment related help then you can directly send your quote so we can help you as soon as we can.

You can send quote at given main directly: