KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.
It is a simplest Machine Learning algorithms based on Supervised Learning technique.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
Steps to do it:
Step-1: Select the number K of the neighbours
Step-2: Calculate the Euclidean distance of K number of neighbours
Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
Step-4: Among these k neighbours, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbour is maximum.
Step-6: Our model is ready.
“New data point” is a point, which is used to find the categories in which this data point is belongs.
To find categories which is satisfied this point calculated by Euclidian distance formula.
Euclidian distance formula:
Example: Using sklearn
#importing librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd#Reading datapath = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"#Assign index name#INDEXcol_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']#Creating pandas dataframedataset = pd.read_csv(path, names = col_names)dataset.head()
Selecting target column:
X = dataset.iloc[:, :-1].valuesy = dataset.iloc[:, 4].values
#Split data#taking 60% training data and 40%testing datafrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)
#using standard scaler to read categorical valuefrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()scaler.fit(X_train)X_train = scaler.transform(X_train)X_test = scaler.transform(X_test)
After splitting the dataset into training and test dataset, we will instantiate k-nearest classifier. Here we are using ‘k =8’, you may vary the value of k and notice the change in result.
#fit into the modelfrom sklearn.neighbors import KNeighborsClassifierclassifier = KNeighborsClassifier(n_neighbors = 8)
Next, we fit the train data by using ‘fit’ function
Predicting the test data:
y_pred = classifier.predict(X_test)
Finding Score with confusion matrix:
Another method to determine optimal K in KNN:
# loading libraryfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.cross_validation import cross_val_scorefrom sklearn.metrics import f1_score# setting only 5 neightbors values to reduce running time you can test it for 50 valuesneighbors = list(range(1,6))# empty list that will hold cv scorescv_scores = # perform 10-fold cross validationfor k in neighbors: # instantiate learning model knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn.fit(X_train, y_train), X_train, y_train, cv=10,scoring='accuracy') cv_scores.append(scores.mean())
# changing to misclassification errorimport matplotlib.pyplot as pltMSE = [1 - x for x in cv_scores]%matplotlib inline# determining best koptimal_k = neighbors[MSE.index(min(MSE))]print("The optimal number of neighbors is %d" % optimal_k)# plot misclassification error vs kplt.plot(neighbors, MSE)plt.xlabel('Number of Neighbors K')plt.ylabel('Misclassification Error')plt.show()
Finding F score using k-cross validation:
# perform F1-scorefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.cross_validation import cross_val_scorefrom sklearn.metrics import f1_scoref1_scores = # setting 50 neightbors valusneighbors = list(range(1,5))for k in neighbors: # instantiate learning model knn = KNeighborsClassifier(n_neighbors=k) # fitting the model knn.fit(X_train, y_train) # predict the response y_pred = knn.predict(X_test) # f1_score based on k f1_scores.append(f1_score(y_test, y_pred, average='micro'))