Unsupervised Learning In the blog we learned two commonly used unsupervised learning algorithms:
Principal Component Analysis
K-Means
In this exercise we will use the two to perform clustering for Iris Data set.
Iris Dataset Iris dataset is a very popular dataset in machine learning community. Developed by Fisher, it containd 3 classes each with 50 instances. Each of the three class refers to a type of Iris plant. Each data point consists of four different attributes and a class label:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris Setosa
Iris Versicolour
Iris Virginica
#@title Import Necessary Modules
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Exercise 1: Download Data Download the data using tf.keras.utils.get_file() function. Pass two arguments to the function the file name and the url containg the data.
For training data use the file name (fname) "iris_training.csv" and the url is "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv"
For test data use the file name (fname) "iris_test.csv" and the url is "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv"
For more information and usage you can refer to TensorFlow documentation.
# To Do Complete the code
## Replace the ... with right code
train_path = tf.keras.utils.get_file(...)
test_path = tf.keras.utils.get_file(...)
# ----------------- Do not change anything below ------------------------------------#
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']
train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
Exercise 2: Data Analysis and Visualization Analyze the data. Analyzing data is avery important skill. You can start from simple information like size of test and train dataset. To random sample of few to plotting between different features. Here is an example dataset visualization notebook for Iris Dataset. You can take some ideas from here.
As part of the exercise we expect you to perform atleast three analysis/visualizations. You can use as many code cells as you need.
Remember train and test are Pandas DataFrames.
## To do write code to analyze and visualize the data
# you can add as many code cells as you requre
# below for example we show the top 5 samples from training dataset.
train.head(5)
output:
Exercise 3 : Preprocess the data Implement the following steps:
Drop the label - Species, since we are doing Unsupervised learning we do not need labels. You may want to save labels separately too verify if indeed your model has been able to cluster properly.
For PCA it is good if the data has zero mean and variance of 1. To achieve this subtract mean and divide by standard deviation.
## Drop the labels
train.drop(...)
test.drop(...)
# Subtract mean from individual value and divide by standard deviation
normalized_train=...
normalized_test=...
Exercise 4: Compute the SUV matrices using TensorFlow linalg() function. Once you get SUV matrices convert S matrix to diagonal matrix using
tf.linalg.diag()
# Compute the SUV matrces
s, u, v = tf.linalg.svd(...)
s = tf.linalg.diag(...)
Exercise 5: Now compute the PCA for 2 principal components. See how the shape gets modified from original dataset and PCA dataset.
k = 2
pca = tf.matmul(...)
print('original data shape',train.shape)
print('reduced data shape', pca.shape)
outputL
original data shape (120, 4) reduced data shape (120, 2)
Exercise 6: Let us plot and see if our PCA is able to cluster the dataset.
plt.scatter(...)
To get any help or support related ML, AI and DL, you can contact us or send your project requirement details at:
Comments