Principal Component Analysis(PCA) Using Wine Dataset | Hire Data Science Expert

realcode4you
Jun 16, 2022
4 min read

The pandas package allows you to handle complex tables of data of different types and time series. Click on the following cell. Then click the 'Run' button to import pandas and check its version.

import pandas as pd
pd.__version__

output:

'1.3.4'

Import the wine dataset

Importing the dataset, wine.csv in the following cell with the correct file name and the path.

Click here to see more information on pandas.read_csv

import pandas as pd

data = pd.read_csv('wine.csv')

You can use the shape function in pandas to check the dimensionality of the dataset as shown in the following cell.

data.shape

output:

(178, 14)

To get a rough idea of this data file’s content, you can print the first five or the last five rows using the commands shown in the following two cells. You can also input an integer (its absolute value should be no more than the total number of rows in the dataset.) in the brackets. More details can be seen in pandas.DataFrame.head and pandas.DataFrame.tail.

data.head(10)

output:

data.tail(8)

output:

Access the data One way to access the data imported from pandas is to use function iloc. Run the following two cells to see what you get and compare the output with the outputs of data.head() and data.tail(). Note that in Python, the index of an array (or matrix) counts from zero. In the pair of square brackets of iloc in the following cell, the comma separates two parts: the first part is for accessing rows; the second part is for accessing columns.

first_two_rows = data.iloc[0:2, :]
first_two_rows

output:

first_two_columns = data.iloc[:, 0:2]
first_two_columns

output:

Slicing of arrays: getting and setting smaller segments within a larger dataframe.

To access a slice of a dataframe, you can use [start:stop:step] for each part in the pair of square brackets. The default values are start=0, stop = size of dimension, step=1. For example, as you may have already known, [0:2, :] is used to access elements starting from the first row, stopping at the third row with a step of 1. The second colon sign means to get all columns.

Get the first 13 columns in the dataframe.

"""Get all features"""
Inputs = data.iloc[:, 0:13]

You can also input an integer in each part of the pair of square brackets to get the specific row or the specific column

Get the last column in the dataframe.

#### """Get labels"""
Labels = data.iloc[:,13]
Labels

Output:

Normalising the data - do it for features only Before doing a PCA analysis, you need to subtract the mean value from each feature. You can do it by applying StandardScaler from sklern.preprocessing: which not only removes the mean but also scales features to have a unit variance. Run the following cell: the first line is to import StandardScaler; the second line is to normalise the data by using fit_transform.

from sklearn.preprocessing import StandardScaler
x1 = StandardScaler().fit_transform(Inputs)

There are two steps in fit_transform(). First, fit() is used to extract the mean value and the standard deviation from each feature. Then transform() is applied to remove the mean and scale the corresponding feature. You can also use fit() and transform() separately.

Normalise the data using fit() and transform() separately

statistics = StandardScaler().fit(Inputs)
x2 = statistics.transform(Inputs)
print(x2.mean(axis=0)) # print the mean value of each feature after removing the mean
print(x2.std(axis=0)) # print the standard deviation value of each feature after removing the mean

output:

[-8.38280756e-16 -1.19754394e-16 -8.37033314e-16 -3.99181312e-17
 -3.99181312e-17  0.00000000e+00 -3.99181312e-16  3.59263181e-16
 -1.19754394e-16  2.49488320e-17  1.99590656e-16  3.19345050e-16
 -1.59672525e-16]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Do a principal component analysis (PCA). You can do it by applying PCA from sklern.decomposition.

from sklearn.decomposition import PCA # import PCA

pca = PCA( ) # initialising a PCA instance.
proj_wine = pca.fit_transform(x2) # The eigen-decomposition is done by using the fit() function; projections of the data in the PCA space is obtained using the transform() function. 
print(proj_wine.shape)

Similarly to StandardScaler(), methods fit() and transform() can also be used separately for PCA(). Run the following cell.

eigen_decom = PCA().fit(x2)
proj_wine = eigen_decom.transform(x2)

It is important to report how much variance has been captured in a PCA analysis. You can obtain the information as shown in the following cell. Run the following cell.

print(pca.explained_variance_)
print(pca.explained_variance_ratio_)

output:

[4.73243698 2.51108093 1.45424187 0.92416587 0.85804868 0.64528221
 0.55414147 0.35046627 0.29051203 0.25232001 0.22706428 0.16972374
 0.10396199]
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]

If we want to keep one decimal only in the results, we can use the round function from numpy as shown in the following cell.

import numpy as np

print(np.round(pca.explained_variance_,1))
print(np.round(pca.explained_variance_ratio_,1))

output;

[4.7 2.5 1.5 0.9 0.9 0.6 0.6 0.4 0.3 0.3 0.2 0.2 0.1] [0.4 0.2 0.1 0.1 0.1 0. 0. 0. 0. 0. 0. 0. 0. ]

How much variance has been captured using the first two principal components.

var = np.sum(pca.explained_variance_[0:2])
print(var)
var_percentage = np.sum(pca.explained_variance_ratio_[0:2])*100
print(var_percentage,'%')

output:

7.243517907228706
55.40633835693526 %

Scree plot: index of principal components against variance (variance percentage)

To produce a plot, you may use matplotlib.pyplot. To do that, run the following cell to import matplotlib.pyplot.

import matplotlib.pyplot as plt
figure = plt.figure()
ax = plt.gca()
plt.plot(pca.explained_variance_, color='red', linestyle='dotted')
ax.set_title("Scree plot")
ax.set_xlabel("Index of principal components")
ax.set_ylabel("The explained varaince")

output:

Data Visualisation using PCA

Produce a scatter plot of the first principal component against the second principal component. You can see more details on how to use the scatter function from here.

figure = plt.figure()
ax = plt.gca()
plt.scatter(proj_wine[:,0],proj_wine[:,1], c=Labels, edgecolor='none', alpha=0.5)
ax.set_title("The PCA plot")
ax.set_xlabel("The first principle component")
ax.set_ylabel("The second principle component")

Output:

Produce a figure including two subplots in one row.

#### import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
fig.tight_layout(pad=2.50) # set subplot spacing

plt.subplot(121) #rows, columns, index
dots_trn=plt.scatter(proj_wine[:,0],proj_wine[:,1], c=Labels, edgecolor='none', alpha=0.5)
plt.xlabel('PC1')
plt.ylabel('PC2')
classes=['C1', 'C2', 'C3']

"""plt.legend(handles=dots_trn.legend_elements()[0], labels=classes)"""

plt.subplot(122) #rows, columns, index
plt.plot(pca.explained_variance_ratio_, color='red', linestyle='dotted')
plt.xlabel("Index of principal components")
plt.ylabel("The explained variance ratio")

output:

Save the pca plot to a file You can save a figure using the savefig() command. For example, to save the previous PCA scatter figure you have produced, you can run the code in the following cell.

figure.savefig('pca_wine.png')

A file called pca_wine.png is saved in the current working directory. You may check if it contains what you think it contains, you can run the code in the following cell.

from IPython.display import Image
Image('pca_wine.png')

output:

You can find the list of supported file types for your system by using the code shown in the following cell.

figure.canvas.get_supported_filetypes()

output:

{'eps': 'Encapsulated Postscript',
 'jpg': 'Joint Photographic Experts Group',
 'jpeg': 'Joint Photographic Experts Group',
 'pdf': 'Portable Document Format',
 'pgf': 'PGF code for LaTeX',
 'png': 'Portable Network Graphics',
 'ps': 'Postscript',
 'raw': 'Raw RGBA bitmap',
 'rgba': 'Raw RGBA bitmap',
 'svg': 'Scalable Vector Graphics',
 'svgz': 'Scalable Vector Graphics',
 'tif': 'Tagged Image File Format',
 'tiff': 'Tagged Image File Format'}