Objectives
Understand the dimensionality reduction problem
Use principal component analysis to solve the dimensionality reduction problem
Through out this lecture we will be using the MNIST dataset. The MNIST dataset consists of thousands of images of handwritten digits from 0 to 1. The dataset is a standard benchmark in machine learning. Here is how to get the dataset from the tensorflow library:
# Import some basic libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('paper')
# Import tensorflow
import tensorflow as tf
# Download the data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
The dataset comes with inputs (that are images of digits) and labels (which is the label of the digit). We are not going to use the labels in this lecture as we will be doing unsupervised learning. Let's look at the dimensions of the training dataset:
x_train.shape
The training dataset is a 3D array. The first dimension is 60,0000. This is the number of different images that we have. Then each image consists of 28x28 pixels. Here is the first image in terms of numbers:
x_train[0]
Each number corresponds to the pixel value. Say, zero is a white pixel and 255 is a black pixel. Values between 0 and 255 correspond to some shade of gray. Here is how to visualize the first image:
plt.imshow(x_train[0], cmap=plt.cm.gray_r, interpolation='nearest')
In this handout, I want to work with just images of threes. So, let me just keep all the threes and throw away all other data:
threes = x_train[y_train == 3]
threes.shape
We have 6,131 threes. That's enough. Now, each image is a 28x28 matrix. We do not like that. We would like to have vectors instead of matrices. So, we need to vectorize the matrices. That's easy to do. We just have to reshape them.
vectorized_threes = threes.reshape((threes.shape[0], threes.shape[1] * threes.shape[2]))
vectorized_threes.shape
Okay. You see that we now have 6,131 vectors each with 784 dimensions. That is our dataset. Let's apply PCA to it to reduce its dimensionality. We are going to use the PCA class of scikit-learn. Here is how to import the class:
from sklearn.decomposition import PCA
And here is how to initialize the model and fit it to the data:
pca = PCA(n_components=0.98, whiten=True).fit(vectorized_threes)
For the complete definition of the inputs to the PCA class, see its documentation. The particular parameters that I define above have the following effect:
n_components: If you set this to an integer, the PCA will have this many components. If you set it to a number between 0 and 1, say 0.98, then PCA will keep as many components as it needs in order to capture 98% of the variance of the data. I use the second type of input.
whiten: This ensures that the projections have unit variance. If you don't specify this then their variance will be the corresponding eigenvalue. Setting whiten=True is consistent with the theory developed in the video.
Okay, so now that the model is trained let's investigate it. First, we asked PCA to keep enough components so that it can describe 98% of the variance. How many did it actually keep? Here is how to check this:
pca.n_components_
It kept 227 compents. This doesn't look very impressive but we will take it for now.
Now, let's focus on the eigenvalues of the covariance matrix. Here is how to get them:
Comments