Project Objective:
Select any uncleaned Data set of your own on the Retail sales topic.
Use different cleaning techniques to clean the data set.
Frame 7 objectives in questions format then clean your data based on your objectives and visualize it in different graphs with solutions for the framed questions.
Reference Project:
Please find below the attached Zomato analysis project file for your reference.
Upload the given file (Notebook.ipynb) in your jupyter notebook.
This Zomato Analysis is a model reference for your understanding.
Use the same techniques as used in Zomato analysis for data cleaning, removing nan values and visualizing the graphs.
Software and Tools to be used :
Use the Python Jupyter tool to do the analysis.
Share the analysis file in notebook.ipynb file format to run in my machine.
Data Set Details:
You can choose your own data set that meets the following below guidelines for the cleaning process.
The data set that you choose must be in the Retails sales topic to do your analysis.
Task Instruction and Scope:
Create functions and methods for the data cleaning process, removing duplicate records, removing outliers, filling nan values, handling nan values which are all required for analysis.
First specify the objective for your analysis to be done.
Please provide recommendations and solutions for each and every analysis that you do.
Please use 5 to 6 cleaning methods to clean the data.
Please provide 7 to 8 visualization graphs and for each visual provide the solutions and recommendations which were made for our analysis.
Here the scope of the project is to have various data cleaning techniques and analyzing techniques.
There should be conversion of columns and its values based on that you should have analysis and solutions.
There should be adding up new columns and renaming the columns that should reflect in our analysis.
Frame a pivot table from the given data set and visualize the graph for the pivot table then frame a solution based on the pivot data.
There should be methods and functions to be defined for filling up nan values and handling the nan values. Suppose if nan values are not available you can add or modify the data set with nan values.
There should be some string operations to split and convert the data type into numeric data then go for analysis.
There should be methods and functions to be defined for handling outliers. Suppose if outliers are not available you can add or modify the data set with outliers.
Graphs should be used as line graph, bar graph, pie chart, box plot, histogram, join graph and scatter plot.
For each graph there should be different solutions and different objectives to be framed in question format.
For each objective data should be cleaned based on the objective question we have then visualize it accordingly to have a meaningful analysis.
Frame 6 to 7 objectives questions and visualize and provide solutions based on analysis. your analysis should be based on complex cleaning techniques.
No need to get into machine learning models which is not required, just focus on different techniques of cleaning datasets and visualize solutions for the framed objectives.
Zomato Data Set Analysis and Visualization
## Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Reading CSV
df = pd.read_csv('C:\\Users\\SIDDHARTH\\Desktop\\New_dir1\\zomato.csv')
df.head()
output:
Shape of the Dataset
df.shape
output:
(51717, 17)
Dataset Columns
df.columns
output:
Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes', 'phone', 'location', 'rest_type', 'dish_liked', 'cuisines', 'approx_cost(for two people)', 'reviews_list', 'menu_item', 'listed_in(type)', 'listed_in(city)'], dtype='object')
df = df.drop(['url', 'address', 'phone', 'menu_item', 'dish_liked', 'reviews_list'], axis = 1)
df.head()
output:
df.info()
output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51717 entries, 0 to 51716 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 51717 non-null object 1 online_order 51717 non-null object 2 book_table 51717 non-null object 3 rate 43942 non-null object 4 votes 51717 non-null int64 5 location 51696 non-null object 6 rest_type 51490 non-null object 7 cuisines 51672 non-null object 8 approx_cost(for two people) 51371 non-null object 9 listed_in(type) 51717 non-null object 10 listed_in(city) 51717 non-null object dtypes: int64(1), object(10) memory usage: 4.3+ MB
Dropping Duplicates
df.duplicated().sum()
output:
108
df.drop_duplicates(inplace = True)
df.shape
output:
(51609, 11)
Cleaning Rate Column
df['rate'].unique()
output:
array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5', '4.3/5', 'NEW', '2.9/5', '3.5/5', nan, '2.6/5', '3.8 /5', '3.4/5', '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5', '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5', '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5', '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
value='2.0/5'
x=value.split("/")
print(x)
x[0]
output:
['2.0', '5']
output:
'2.0'
Removing "NEW" , "-" and "/5" from Rate Column
def handlerate(value):
if(value=='NEW' or value=='-'):
return np.nan
else:
value = str(value).split('/')
value = value[0]
return float(value)
df['rate'] = df['rate'].apply(handlerate)
df['rate'].head()
df.info()
output:
<class 'pandas.core.frame.DataFrame'> Index: 51609 entries, 0 to 51716 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 51609 non-null object 1 online_order 51609 non-null object 2 book_table 51609 non-null object 3 rate 41590 non-null float64 4 votes 51609 non-null int64 5 location 51588 non-null object 6 rest_type 51382 non-null object 7 cuisines 51564 non-null object 8 approx_cost(for two people) 51265 non-null object 9 listed_in(type) 51609 non-null object 10 listed_in(city) 51609 non-null object dtypes: float64(1), int64(1), object(9) memory usage: 4.7+ MB
df.isna().sum()
output:
(df.isna().sum()/df.shape[0])*100
output:
Filling Null Values in Rate Column with Mean
df['rate'].fillna(df['rate'].mean(), inplace = True)
df['rate'].isnull().sum()
Dropping Null Values
df.dropna(inplace = True)
df.head()
output:
df.rename(columns = {'approx_cost(for two people)':'Cost2plates', 'listed_in(type)':'Type'}, inplace = True)
df.head()
output:
df['location'].unique()
output:
array(['Banashankari', 'Basavanagudi', 'Mysore Road', 'Jayanagar', 'Kumaraswamy Layout', 'Rajarajeshwari Nagar', 'Vijay Nagar', 'Uttarahalli', 'JP Nagar', 'South Bangalore', 'City Market', 'Nagarbhavi', 'Bannerghatta Road', 'BTM', 'Kanakapura Road', 'Bommanahalli', nan, 'CV Raman Nagar', 'Electronic City', 'HSR', 'Marathahalli', 'Sarjapur Road', 'Wilson Garden', 'Shanti Nagar', 'Koramangala 5th Block', 'Koramangala 8th Block', 'Richmond Road', 'Koramangala 7th Block', 'Jalahalli', 'Koramangala 4th Block', 'Bellandur', 'Whitefield', 'East Bangalore', 'Old Airport Road', 'Indiranagar', 'Koramangala 1st Block', 'Frazer Town', 'RT Nagar', 'MG Road', 'Brigade Road', 'Lavelle Road', 'Church Street', 'Ulsoor', 'Residency Road', 'Shivajinagar', 'Infantry Road', 'St. Marks Road', 'Cunningham Road', 'Race Course Road', 'Commercial Street', 'Vasanth Nagar', 'HBR Layout', 'Domlur', 'Ejipura', 'Jeevan Bhima Nagar', 'Old Madras Road', 'Malleshwaram', 'Seshadripuram', 'Kammanahalli', 'Koramangala 6th Block', 'Majestic', 'Langford Town', 'Central Bangalore', 'Sanjay Nagar', 'Brookefield', 'ITPL Main Road, Whitefield', 'Varthur Main Road, Whitefield', 'KR Puram', 'Koramangala 2nd Block', 'Koramangala 3rd Block', 'Koramangala', 'Hosur Road', 'Rajajinagar', 'Banaswadi', 'North Bangalore', 'Nagawara', 'Hennur', 'Kalyan Nagar', 'New BEL Road', 'Jakkur', 'Rammurthy Nagar', 'Thippasandra', 'Kaggadasapura', 'Hebbal', 'Kengeri', 'Sankey Road', 'Sadashiv Nagar', 'Basaveshwara Nagar', 'Yeshwantpur', 'West Bangalore', 'Magadi Road', 'Yelahanka', 'Sahakara Nagar', 'Peenya'], dtype=object)
df['listed_in(city)'].unique()
out:
array(['Banashankari', 'Bannerghatta Road', 'Basavanagudi', 'Bellandur', 'Brigade Road', 'Brookefield', 'BTM', 'Church Street', 'Electronic City', 'Frazer Town', 'HSR', 'Indiranagar', 'Jayanagar', 'JP Nagar', 'Kalyan Nagar', 'Kammanahalli', 'Koramangala 4th Block', 'Koramangala 5th Block', 'Koramangala 6th Block', 'Koramangala 7th Block', 'Lavelle Road', 'Malleshwaram', 'Marathahalli', 'MG Road', 'New BEL Road', 'Old Airport Road', 'Rajajinagar', 'Residency Road', 'Sarjapur Road', 'Whitefield'], dtype=object)
Listed in(city) and location, both are there, lets keep only one
df = df.drop(['listed_in(city)'], axis = 1)
df['Cost2plates'].unique()
output:
array(['800', '300', '600', '700', '550', '500', '450', '650', '400', '900', '200', '750', '150', '850', '100', '1,200', '350', '250', '950', '1,000', '1,500', '1,300', '199', '80', '1,100', '160', '1,600', '230', '130', '50', '190', '1,700', nan, '1,400', '180', '1,350', '2,200', '2,000', '1,800', '1,900', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40', '1,250', '3,500', '4,000', '2,400', '2,600', '120', '1,450', '469', '70', '3,200', '60', '560', '240', '360', '6,000', '1,050', '2,300', '4,100', '5,000', '3,700', '1,650', '2,700', '4,500', '140'], dtype=object)
Removing , from Cost2Plates Column
def handlecomma(value):
value = str(value)
if ',' in value:
value = value.replace(',', '')
return float(value)
else:
return float(value)
df['Cost2plates'] = df['Cost2plates'].apply(handlecomma)
df['Cost2plates'].unique()
output:
array([ 800., 300., 600., 700., 550., 500., 450., 650., 400., 900., 200., 750., 150., 850., 100., 1200., 350., 250., 950., 1000., 1500., 1300., 199., 80., 1100., 160., 1600., 230., 130., 50., 190., 1700., nan, 1400., 180., 1350., 2200., 2000., 1800., 1900., 330., 2500., 2100., 3000., 2800., 3400., 40., 1250., 3500., 4000., 2400., 2600., 120., 1450., 469., 70., 3200., 60., 560., 240., 360., 6000., 1050., 2300., 4100., 5000., 3700., 1650., 2700., 4500., 140.])
Contact us at realcode4you@gmail.com to unlock complete solution of above problem or any other data analysis projects where you get completed datasets or uncleaned dataset.
Comments