Exploratory Data Analysis(EDA) Assignment Help In Python Machine Learning

Updated: Oct 6, 2021

Here we will covers the EDA in python machine learning. If you are looking EDA Assignment Help, Project Help, Homework Help.


Topic: To visualise how honey production is changed over the years (1998-2016) in the United States.


Background:

In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component of American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America from 1998 to 2016.


Objective:

To visualise how honey production is changed over the years (1998-2016) in the United States.


Key questions to be answered:

  • How has honey production yield changed from 1998 to 2016?

  • Over time, what are the major production trends been across the states?

  • Are there any patterns that can be observed between total honey production and value of production every year? How has value of production, which in some sense could be tied to demand, changed every year?

Dataset:

  • state: Various states of U.S.

  • numcol: Number of honey-producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies that did not survive the entire year

  • yieldpercol: Honey yield per colony. Unit is pounds

  • totalprod: Total production (numcol x yieldpercol). Unit is pounds

  • stocks: Refers to stocks held by producers. Unit is pounds

  • priceperlb: Refers to average price per pound based on expanded sales. The unit is dollars.

  • prodvalue: Value of production (totalprod x priceperlb). The unit is dollars.

  • year: Year of production


Import the necessary packages - pandas, numpy, seaborn, matplotlib.pyplot

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
pd.set_option('display.float_format', lambda x: '%.5f' % x) # To supress numerical display in scientific notations


Read in the dataset

honeyprod = pd.read_csv("honeyproduction1998-2016.csv")

View the first few rows of the dataset

honeyprod.head(10)

Output:










Observations: The dataset looks clean and consistent with the description provided in the Data Dictionary.

Check the shape of the dataset

honeyprod.shape

Output:

(785, 8)


Observations: We have 785 observations of 8 columns



Check the datatype of the variables to make sure that the data is read in properly


honeyprod.dtypes

Output:

state object numcol float64 yieldpercol int64 totalprod float64 stocks float64 priceperlb float64 prodvalue float64 year int64 dtype: object


Observations:

  1. state is object data type

  2. year is integer type currently. Since year is a categorical variable here, let us convert it to category data data type in Python.

  3. All the other variables are numerical and there for their python data types (float64 and int64) are ok.


honeyprod.year = honeyprod.year.astype('category') # To convert year into categories
# Uncomment the following code to learn more about the astype function and its attribtes
# help(honeyprod.astype)

Let us analyse the quantitative variables in the dataset


honeyprod.describe()

output:









Observations:

  1. Number of colonies in every state are spread over a huge range. Ranging from 2000 to 510000.

  2. The mean numcol is close to the 75% percentile of the data, indicating a right skew.

  3. As expected, standard deviation of numcol is very high

  4. yieldpercol - Yield per colony also has huge spread ranging from 19 pounds to 136 pounds.

  5. Infact, all the variable seem to have a huge range, we will have to investigate furthur if this spread is mainly across different states or varies in the same state over the years.


Looking at the relationship between numerical variables using pair plots and correlation plots


sns.pairplot(honeyprod, diag_kind="kde")

Output:











correlation = honeyprod.corr() # creating a 2-D Matrix with correlation plots
correlation


Output:









# Uncomment the following code for information of the arguments
# help(sns.heatmap)
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Output:










Observations:

  1. Number of colonies have a high positive correlation with total production, stocks and the value of production. As expected, all these values are highly correlated with each other.

  2. Yield per colony does not have a high correlation with any of the features that we have in our dataset.

  3. Same is the case with priceperlb.

  4. Determining the factors influencing per colony yield and price per pound of honey would need furthur investigation.


Let us now explore the categorical features - state and year

print(honeyprod.state.nunique())
print(honeyprod.year.nunique())

Output:

44
19

We have honey production data for 44 US states over a span of 19 years, from 1998 to 2016.

Let us look at the overall trend of honey production in the US over the years

plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()

# Uncomment the following code to check the actual values
# honeyprod.groupby(['year'])['totalprod'].sum().reset_index()

Output:











Observations:

  1. The overall honey production in the US has been decresing over the years.

  2. Total honey production = number of colonies * average yield per colony. Let us check if the honey production is decreasing due to one of these factors or both.

Variation in the number of colonies over the years

plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='numcol', data=honeyprod, ci=None, estimator=sum)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()

Output:











Observations:

  1. The number of colonies across the country shows a declining trend from 1998-2008 but has seen an uptick since 2008.

  2. It is possible that there was some intervension in 2008 that help in increasing the number of honey bee colonies across the country.


Variation of yield per colony over the years


plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='yieldpercol', data=honeyprod, estimator=sum, ci=None)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()

Output:



Onservation:

  1. In contrast to number of colonies, the yield per colony has been decreasing since 1998.

  2. This indicates that, it is not the number of colonies that is causing a decline in totalhoney production but the yield per colony.


Let us look at the production trend at state level



# Add hue parameter to the pointplot to plot for each state
plt.figure(figsize=(15, 7)) # To resize the plot
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None, hue = 'state')
plt.legend(bbox_to_anchor=(1, 1))
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()

Output:

















Observations: There are some states that have much higher productions than the others but this plot is a little hard to read. Let us try plotting each state seperatly for a better understanding.


Catplot:

sns.catplot(x='year', y='totalprod', data=honeyprod,
                estimator=sum, col='state', kind="point",
                height=3,col_wrap = 5)
plt.show()

Output:















Observations:

  1. The most prominent honey producing states of US are - California, Florida, North Dakota and South Dakota and Montana

  2. Unfortunately, the honey production in California has seen a steep decline over the years.

  3. Florida's total production also has been on a decline.

  4. South Dakota has more of less maintained its levels of production.

  5. North Dakota has actually seen an impressive increase in the honey production.


Let us look at the yearly trend in number of colonies and yield per colony in these 5 states


cplot1=sns.catplot(x='year', y='numcol', 
            data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
                estimator=sum, col='state', kind="point",
                height=3,col_wrap = 5)
cplot1.set_xticklabels(rotation=90)
plt.show()

Output:


cplot2=sns.catplot(x='year', y='yieldpercol', 
            data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
                estimator=sum, col='state', kind="point",
                height=3,col_wrap = 5)
cplot2.set_xticklabels(rotation=90)
plt.show()

Output: