Exploratory Data Analysis (EDA) In Python Machine Learning

Introduction

In this we get some initial experience with using some of the main python tools for this course, including Numpy, Matplotlib and Pandas. We also load some datasets, compute some basic statistics on them and plot them.


Before start it assumes that you are familiar with Python.


The lab can be executed on either your own machine (with anaconda installation).



Objective

  • Continue to familiarise with Python and AWS

  • Load dataset and examine the dataset

  • Learn to compute basic statistics to understand the dataset more

  • Plot the datasets to visually investigate the dataset


Dataset

We examine two regression based datasets in this lab. The first one is to do with house prices, some factors associated with the prices and trying to predict house prices. The second dataset is predicting the amount of share bikes hired every day in Washington D.C., USA, based on time of the year, day of the week and weather factors. These datasets are available in housing.data.csv and bikeShareDay.csv in the code repository.

First, ensure the two data files are located within the Jupyter workspace.

  • If you are on the local machine copy the two data data directories (BostonHousingPrice,Bike-Sharing-Dataset) to your current folder.

  • If you are on AWS you can upload the data to the notebook instance by clicking the upload files icon on the left sidebar.

Task: Open the csv files in your favourite spreadsheets software (e.g. Excel) and observe the data.


Load dataset to Python Notebook

Next we examine how to load these into Python and Jupyter notebooks. We will first analyse the House prices dataset, then you’ll repeat the process to analyse the bike hire dataset.


First we need to import a few packages that will be used for our data loading and analysis. In python notebook you can load packages just before it is called (no need to load them at the start of the program).


Pandas is a great Python package for loading data. We will use Matplotlib to visualise some of the distributions. Numpy is a numeric library that has many useful matrices and mathematical functionality.


Importing Libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Next, we use pandas to load the house price dataset:

bostonHouseFrame = pd.read_csv('./BostonHousingPrice/housing.data.csv', delimiter='\s+')

Assuming the dataset is in ./BostonHousingPrice/housing.data.csv. Replace this with the relative or absolute path to your files. We strongly encourage you to look up the documentation of the functions we use in the lab, in this case examine Pandas read_csv documentation.


The read_csv() command loads the input file, which is a csv formatted file delimited by tabs, into a Pandas dataframe (which can be thought of as a table). A dataframe can store the column names as well as the data. Examine what has been loaded into the dataframe bostonHouseFrame.


print(bostonHouseFrame)

Output:












If you are interested in checking only the first few rows of the dataframe to see if you have read the data in correctly, you can use the head method in dataframe.

bostonHouseFrame.head(3)

Output:





☞ Task: Familiarize yourself with dataframes.

Data frames are a very useful tool that will be used throughout the course, and we strongly suggest to familiarise yourselves with it. Here is some introductory material for it: Pandas Tutorial: DataFrames in Python.


Now we have loaded the data into a data frame and printed it out, next we will compute some very basic statistics. The abbreviated column names:

  • CRIM: per capita crime rate by town

  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS: proportion of non-retail business acres per town

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  • NOX: nitric oxides concentration (parts per 10 million)

  • RM: average number of rooms per dwelling

  • AGE: proportion of owner-occupied units built prior to 1940

  • DIS: weighted distances to five Boston employment centres

  • RAD: index of accessibility to radial highways

  • TAX: full-value property-tax rate per USD10,000

  • PTRATIO: pupil-teacher ratio by town

  • B: 1000 (Bk - 0.63)^2 where Bk is the proportion of blacks by town

  • LSTAT: lower status of the population

  • MEDV: Median value of owner-occupied homes in USD1000's

The target column is MEDV and all the other columns are attributes.


Study the variables carefully and understand what they represent before moving to the next section.




Exploratory Data Analysis (EDA)

Often the first step in developing a machine learning solution for a given dataset is the EDA. EDA refers to the critical process of performing initial investigations on data so as to:

  • Maximize insight into a data set;

  • Uncover underlying structure;

  • Extract important variables;

  • Detect outliers and anomalies;

  • Test underlying assumptions;

  • Develop parsimonious models; and

  • Determine optimal factor settings.

with the help of summary statistics and graphical representations. The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

  • Plotting the raw data (such as data traces, histograms, bi-histograms, probability plots, lag plots, block plots, and Youden plots.

  • Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data.

  • Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page.


⚠ Warning: EDA is a subjective process and will depend on the task & the data you have. There is no globally correct way of doing this. Usually you need to have a good understanding of the task before deciding what EDA techniques to use and continuously refine them based on the observations you make in the initial steps. Since we are still at the beginning of the course, let's explore some commonly used techniques. You will understand the significance of these methods and observations in terms of ML in the next couple of weeks.


Let's first see the shape of the dataframe.

bostonHouseFrame.shape

Output:

(506, 14)


� What does the above output tell you?

It is also a good practice to know the columns and their corresponding data types, along with finding whether they contain null values or not.


bostonHouseFrame.info()

Output:














� Are there any missing values in the dataset?

In pandas any missing values in the data (your input CSV file) is represented as NaN.

Next let's compute some summary statistics of the data we have read. The describe() function in pandas is very handy in getting various summary statistics. This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

bostonHouseFrame.describe()

Output:


� What insights did you get from the above output? Look closely at attributes ZN, and CHAS, do you see a difference in those two compared to the others.

Data comes in two principle types in statistics, and it is crucial that we recognize the differences between these two types of data.

  1. Categorical Variables: These are data points that take on a finite number of values, AND whose values do not have a numerical interpretation.

  • Ordinal categorical variables take on values which can be logically ordered. For example, the reviews for a product which are given as 0-5 stars.

  • Nominal categorical variables cannot be put in any logical order. Examples of this would be the gender, race, etc.


  1. Numerical Variables: These are variables which are numerical in nature

  • Continuous Variables: Take on a continuous values (no breaks). For example, height, weight.

  • Discrete numerical variables take on a set of values which can be counted. For example, the number of rooms in a house.


�Try to identify what type of data is in the bostonHouseFrame dataframe. �What is the type of data for CHAS and RAD?

Data Distribution One of the most important step in EDA is estimating the distribution of a variable. Lets begin with histogram plot.

plt.figure(figsize=(20,20))
for i, col in enumerate(bostonHouseFrame.columns):
    plt.subplot(4,5,i+1)
    plt.hist(bostonHouseFrame[col], alpha=0.3, color='b', density=True)
    plt.title(col)
    plt.xticks(rotation='vertical')

Output:


Warning: Always question the bin sizes of a histogram to see whether they are appropriate for the plot being presented. If you see a histogram with illogically large or small bin sizes and/or uneven bin sizes beware of the results being presented!


� What observations did you make?

Observations:

  • Attribute CHAS is a categorical variable. Most data instances are from class 0 and only a few instances are from class 1.

  • Many attributes are heavily skewed. e.g. CRIM, ZN, DIS, AGE, B ...

  • Attributes RAD and TAX has values that are far from the majority values. Further investigations are needed.

  • Target variable MEDV is distributed around 22 with some extreme values around 50.


Box plot is another useful tool in examining the data. Lets use a box plot to observe our target variable MEDV.


plt.boxplot(bostonHouseFrame['MEDV'])
plt.title('Median House Price')
plt.show()

Output:










How to read Box Plots:

  • The thick line in the middle of the box gives the median value.

  • The top of the box shows Quantile-.75

  • The bottom of the box shows Quantile-.25

  • So the height of the box in the Inter Quantile Range (IQR)

  • The top whisker —| shows Q0.75+1.5∗IQR, the upper cutoff for outliers using Tukey’s rule

  • The bottom whisker —| shows Q0.25−1.5∗IQR, the lower cutoff for outliers using Tukey’s rule

  • Any data points (circles) show outlier values

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.


Relationship between variable

In the previous section we observed each attribute (data column) independently. Sometimes it is also useful to observe the relationship between two variables. There are several techniques that we can use for this purpose. One of the key techniques is a scatter plot.

Since our task is to predict MEDV (target variable) using all other attributes, let's plot the relationship between MEDV and other columns.


For this we can use matplotlib. However there is another python package called seaborn that plots nice looking figures. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. You can learn more about seaborn at seaborn: statistical data visualization



import seaborn as sns
plt.figure(figsize=(20,20))
for i, col in enumerate(bostonHouseFrame.columns):
    plt.subplot(4,5,i+1)
    sns.scatterplot(data=bostonHouseFrame, x=col, y='MEDV')
    # sns.regplot(x=col,y='MEDV', data=bostonHouseFrame)
    plt.title(col)


plt.xticks(rotation='vertical')
plt.show()

Output:


We have used the seaborn scatterplot function above. Explore the function documentation to identify its features.

Another tool that can be used is the seaborn regplot, which also plots data and a linear regression model fit. Try this yourself.

� What observations did you make?

Observations:

  • There seems to be a good linear relationship between MEDV and RM.

  • The relationship between MEDV and some variables appears to be nonlinier (e.g. LSAT).

  • ...


Scatter plots give good information when the attribute you are examining is numerical. What if there are categorical attributes?

If we have categorical attributes and continuous variables we can examine them using a boxplot. Lets see the relationship between MEDV and CHAS


ax = sns.boxplot(y='MEDV',x='CHAS',data=bostonHouseFrame)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title('CHAS')

plt.xticks(rotation='vertical')
plt.show()

Output:











� What observations did you make?

Observations:

  • On average, the house price for data instances with CHAS=1 is higher than for data instances with CHAS=0

Correlation is another important statistic when developing ML models. Lets plot the correlation matrix for the numerical data we have:


import seaborn as sns

f, ax = plt.subplots(figsize=(11, 9))
corr = bostonHouseFrame.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=90,
    horizontalalignment='right'
);

Output: