Updated: Oct 6, 2021
Here we will covers the EDA in python machine learning. If you are looking EDA Assignment Help, Project Help, Homework Help.
Topic: To visualise how honey production is changed over the years (1998-2016) in the United States.
In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component of American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America from 1998 to 2016.
To visualise how honey production is changed over the years (1998-2016) in the United States.
Key questions to be answered:
How has honey production yield changed from 1998 to 2016?
Over time, what are the major production trends been across the states?
Are there any patterns that can be observed between total honey production and value of production every year? How has value of production, which in some sense could be tied to demand, changed every year?
state: Various states of U.S.
numcol: Number of honey-producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies that did not survive the entire year
yieldpercol: Honey yield per colony. Unit is pounds
totalprod: Total production (numcol x yieldpercol). Unit is pounds
stocks: Refers to stocks held by producers. Unit is pounds
priceperlb: Refers to average price per pound based on expanded sales. The unit is dollars.
prodvalue: Value of production (totalprod x priceperlb). The unit is dollars.
year: Year of production
Import the necessary packages - pandas, numpy, seaborn, matplotlib.pyplot
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import warnings pd.set_option('display.float_format', lambda x: '%.5f' % x) # To supress numerical display in scientific notations
Read in the dataset
honeyprod = pd.read_csv("honeyproduction1998-2016.csv")
View the first few rows of the dataset
Observations: The dataset looks clean and consistent with the description provided in the Data Dictionary.
Check the shape of the dataset
Observations: We have 785 observations of 8 columns
Check the datatype of the variables to make sure that the data is read in properly
state object numcol float64 yieldpercol int64 totalprod float64 stocks float64 priceperlb float64 prodvalue float64 year int64 dtype: object
state is object data type
year is integer type currently. Since year is a categorical variable here, let us convert it to category data data type in Python.
All the other variables are numerical and there for their python data types (float64 and int64) are ok.
honeyprod.year = honeyprod.year.astype('category') # To convert year into categories # Uncomment the following code to learn more about the astype function and its attribtes # help(honeyprod.astype)
Let us analyse the quantitative variables in the dataset
Number of colonies in every state are spread over a huge range. Ranging from 2000 to 510000.
The mean numcol is close to the 75% percentile of the data, indicating a right skew.
As expected, standard deviation of numcol is very high
yieldpercol - Yield per colony also has huge spread ranging from 19 pounds to 136 pounds.
Infact, all the variable seem to have a huge range, we will have to investigate furthur if this spread is mainly across different states or varies in the same state over the years.
Looking at the relationship between numerical variables using pair plots and correlation plots
correlation = honeyprod.corr() # creating a 2-D Matrix with correlation plots correlation
# Uncomment the following code for information of the arguments # help(sns.heatmap) plt.figure(figsize=(15, 7)) sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
Number of colonies have a high positive correlation with total production, stocks and the value of production. As expected, all these values are highly correlated with each other.
Yield per colony does not have a high correlation with any of the features that we have in our dataset.
Same is the case with priceperlb.
Determining the factors influencing per colony yield and price per pound of honey would need furthur investigation.
Let us now explore the categorical features - state and year
We have honey production data for 44 US states over a span of 19 years, from 1998 to 2016.
Let us look at the overall trend of honey production in the US over the years
plt.figure(figsize=(15, 7)) sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None) plt.xticks(rotation=90) # To rotate the x axis labls plt.show() # Uncomment the following code to check the actual values # honeyprod.groupby(['year'])['totalprod'].sum().reset_index()
The overall honey production in the US has been decresing over the years.
Total honey production = number of colonies * average yield per colony. Let us check if the honey production is decreasing due to one of these factors or both.
Variation in the number of colonies over the years
plt.figure(figsize=(15, 7)) sns.pointplot(x='year', y='numcol', data=honeyprod, ci=None, estimator=sum) plt.xticks(rotation=90) # To rotate the x axis labls plt.show()
The number of colonies across the country shows a declining trend from 1998-2008 but has seen an uptick since 2008.
It is possible that there was some intervension in 2008 that help in increasing the number of honey bee colonies across the country.
Variation of yield per colony over the years
plt.figure(figsize=(15, 7)) sns.pointplot(x='year', y='yieldpercol', data=honeyprod, estimator=sum, ci=None) plt.xticks(rotation=90) # To rotate the x axis labls plt.show()
In contrast to number of colonies, the yield per colony has been decreasing since 1998.
This indicates that, it is not the number of colonies that is causing a decline in totalhoney production but the yield per colony.
Let us look at the production trend at state level
# Add hue parameter to the pointplot to plot for each state plt.figure(figsize=(15, 7)) # To resize the plot sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None, hue = 'state') plt.legend(bbox_to_anchor=(1, 1)) plt.xticks(rotation=90) # To rotate the x axis labls plt.show()
Observations: There are some states that have much higher productions than the others but this plot is a little hard to read. Let us try plotting each state seperatly for a better understanding.
sns.catplot(x='year', y='totalprod', data=honeyprod, estimator=sum, col='state', kind="point", height=3,col_wrap = 5) plt.show()
The most prominent honey producing states of US are - California, Florida, North Dakota and South Dakota and Montana
Unfortunately, the honey production in California has seen a steep decline over the years.
Florida's total production also has been on a decline.
South Dakota has more of less maintained its levels of production.
North Dakota has actually seen an impressive increase in the honey production.
Let us look at the yearly trend in number of colonies and yield per colony in these 5 states
cplot1=sns.catplot(x='year', y='numcol', data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])], estimator=sum, col='state', kind="point", height=3,col_wrap = 5) cplot1.set_xticklabels(rotation=90) plt.show()
cplot2=sns.catplot(x='year', y='yieldpercol', data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])], estimator=sum, col='state', kind="point", height=3,col_wrap = 5) cplot2.set_xticklabels(rotation=90) plt.show()