top of page

Implement and evaluate k-Nearest Neighbor (k-NN) Classifiers for a Given dataset | Realcode4you

1.The k-NN classifiers algorithms


1.1. Create two different functions by yourself to calculate the distance between two instances. One function for numerical features and another one for categorical features. You may choose an appropriate distance metric for this purpose, such as Euclidean distance for numerical features and Hamming distance for categorical features .


1.2. Mention the advantages and limitations of k-Nearest Neighbour (k-NN) classifiers.


You may use Python programming language. Please provide clear and concise code, along with comments explaining your implementation.


2. Apply the k-NN classifiers to a dataset

Choose a suitable dataset from the “KNN Datasets” folder with numerical or categorical features and a target variable. Apply k-NN classifier to this dataset.


2.1. Preprocess the dataset, including handling missing values (if require), encoding categorical features (if necessary), and visualize the data using scatterplot. In scatterplot, mark different class by different colour.


2.2. Splitting the data into training and testing sets and perform feature scaling.


2.2. Train your k-NN classifiers using the training set and predict the classes for the testing set. Calculate the accuracy of your classifiers on the testing set


2.3. Experiment with different values of k for the k-NN classifier and discuss how the choice of k affects the classifier's performance. Choose an optimal value of k based on your experiments


Note that for k-NN classifiers, the training process is minimal, as it only involves storing the training instances. The classification is done during the prediction step by finding the nearest neighbor(s) for each test instance.



#Importing Required Modules

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


#1. The k-NN classifiers algorithms

1.1. Create two different functions by yourself to calculate the distance between two instances. One function for numerical features and another one for categorical features. You may choose an appropriate distance metric for this purpose, such as Euclidean distance for numerical features and Hamming distance for categorical features


def euclidean_distance(instance1, instance2):

"""

Calculate Euclidean distance between two instances with numerical features.


Parameters:

instance1 (list or numpy array): First instance with numerical features.

instance2 (list or numpy array): Second instance with numerical features.


Returns:

float: Euclidean distance between the two instances.

"""

instance1 = np.array(instance1)

instance2 = np.array(instance2)

return np.sqrt(np.sum((instance1 - instance2) ** 2))


The euclidean_distance function computes the Euclidean distance between two instances with numerical features. It starts by converting the input lists into numpy arrays for efficient computation. Then, it utilizes the Euclidean distance formula, which involves taking the square root of the sum of squared differences between corresponding features of the two instances. This function is suitable for continuous numerical data and is widely used in various machine learning algorithms.


def hamming_distance(instance1, instance2):

"""

Calculate Hamming distance between two instances with categorical features.


Parameters:

instance1 (list): First instance with categorical features.

instance2 (list): Second instance with categorical features.


Returns:

int: Hamming distance between the two instances.

"""

if len(instance1) != len(instance2):

raise ValueError("Instances must have the same length")

distance = 0

for i in range(len(instance1)):

if instance1[i] != instance2[i]:

distance += 1

return distance


The hamming_distance function calculates the Hamming distance between two instances with categorical features. It verifies that the instances have the same length and proceeds to count the number of differing features between them. Each differing feature increments the distance counter, providing a measure of dissimilarity between the categorical features of the instances. This function is particularly useful for categorical data, where the features represent discrete categories or labels.


1.2. Mention the advantages and limitations of k-Nearest Neighbour (k-NN) classifiers .

  • Simple and easy to understand.

  • No training phase, as it stores all the data points.

  • Effective for smaller datasets with fewer dimensions.

  • Computationally expensive for large datasets.

  • Sensitive to irrelevant features.

  • Requires careful selection of distance metric and value of k.

  • Not suitable for imbalanced datasets


#2. Apply the k-NN classifiers to a dataset (25 marks)

# Load the Iris dataset into a DataFrame
iris_df = pd.read_csv("iris.data", header=None)
# Assuming the dataset has column names, you can set them like this:
iris_df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
iris_df.head()

The dataset chosen for this task is the Iris dataset. It is a classic dataset in machine learning and consists of 150 samples of iris flowers, with each sample containing four features: sepal length, sepal width, petal length, and petal width. The target variable is the species of the iris flower, which can take one of three classes: setosa, versicolor, or virginica.


iris_df.info()

output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 class 150 non-null int64 dtypes: float64(4), int64(1) memory usage: 6.0 KB


Shape

print("Total Rows in dataset >> ",iris_df.shape[0])

print("Total Columns in dataset >> ",iris_df.shape[1])


output:

Total Rows in dataset >> 150 Total Columns in dataset >> 5


2.1. Preprocess the dataset, including handling missing values, and split the data into training and testing sets


# Check for missing or null values >>
iris_df.isnull().sum()

output:

sepal_length 0 sepal_width 0 petal_length 0 petal_width 0 class 0 dtype: int64


# Visualize the counts of class for our iris data >>

iris_df['class'].value_counts().plot(kind='bar')


output:


# Visualize the data using scatterplot

plt.figure(figsize=(10, 6)) # Set the size of the plot

colors = {'Iris-setosa':'red', 'Iris-versicolor':'blue', 'Iris-virginica':'green'} # Define colors for each class

for class_name, color in colors.items():

# Scatter plot each class separately

plt.scatter(iris_df[iris_df['class'] == class_name]['sepal_length'],

iris_df[iris_df['class'] == class_name]['sepal_width'],

color=color, label=class_name) # Plot sepal length vs sepal width for each class

plt.xlabel('Sepal Length') # Set label for x-axis

plt.ylabel('Sepal Width') # Set label for y-axis

plt.title('Sepal Length vs Sepal Width') # Set title for the plot

plt.legend() # Show legend for different classes

plt.show() # Display the plot


output:



Hire us to get complete solution of above problem statement or if you have any other project related to ML algorithms.


For more details you can send your request at:


Comments


bottom of page