KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.

It is a simplest Machine Learning algorithms based on Supervised Learning technique.

K-NN algorithm stores all the available data and classifies a new data point based on the similarity.

This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

Steps to do it:

**Step-1:** Select the number K of the neighbours

**Step-2:** Calculate the Euclidean distance of **K number of neighbours**

**Step-3:** Take the K nearest neighbours as per the calculated Euclidean distance.

**Step-4:** Among these k neighbours, count the number of the data points in each category.

**Step-5:** Assign the new data points to that category for which the number of the neighbour is maximum.

**Step-6:** Our model is ready.

“New data point” is a point, which is used to find the categories in which this data point is belongs.

To find categories which is satisfied this point calculated by Euclidian distance formula.

Euclidian distance formula:

__Example: Using sklearn__

```
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Reading data
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
#Assign index name
#INDEX
col_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#Creating pandas dataframe
dataset = pd.read_csv(path, names = col_names)
dataset.head()
```

Selecting target column:

```
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
```

```
#Split data
#taking 60% training data and 40%testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)
```

```
#using standard scaler to read categorical value
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
```

After splitting the dataset into training and test dataset, we will instantiate k-nearest classifier. Here we are using ‘k =8’, you may vary the value of k and notice the change in result.

```
#fit into the model
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 8)
```

Next, we fit the train data by using ‘fit’ function

`classifier.fit(X_train, y_train)`

__Output:__

__Predicting the test data:__

`y_pred = classifier.predict(X_test)`

__Finding Score with confusion matrix:__

Another method to determine optimal K in KNN:

```
# loading library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score
# setting only 5 neightbors values to reduce running time you can test it for 50 values
neighbors = list(range(1,6))
# empty list that will hold cv scores
cv_scores = []
# perform 10-fold cross validation
for k in neighbors:
# instantiate learning model
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn.fit(X_train, y_train), X_train, y_train, cv=10,scoring='accuracy')
cv_scores.append(scores.mean())
```

```
# changing to misclassification error
import matplotlib.pyplot as plt
MSE = [1 - x for x in cv_scores]
%matplotlib inline
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
```

Finding F score using k-cross validation:

```
# perform F1-score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score
f1_scores = []
# setting 50 neightbors valus
neighbors = list(range(1,5))
for k in neighbors:
# instantiate learning model
knn = KNeighborsClassifier(n_neighbors=k)
# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
# f1_score based on k
f1_scores.append(f1_score(y_test, y_pred, average='micro'))
```

`print(f1_scores)`