Data Wrangling is the process of converting data from the initial format to a format that may be readable and better for analysis.
Here we use the below data set :
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Import pandas
Open Jupyter notebook or any online jupyter notebook editor and import pandas-
import pandas as pd
import matplotlib.pylab as plt
Want to add a caption to this image? Click the Settings icon.
Reading the data and add header
filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base",
"length","width","height","curb-weight","engine-type", "num-of-cylinders", "engine-size","fuel
-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
Want to add a caption to this image? Click the Settings icon.
Read CSV
df = pd.read_csv(filename, names = headers)
Show data in tabular form
df.head()
Data display in tabular form and you will face some challenges like this-
identify missing data
deal with missing data
correct data format
Identify and handle missing values
Identify missing values
Convert "?" to NaN
Missing data comes with the question mark "?". We replace "?" with NaN (Not a Number)
Example:
import numpy as np
# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)
It set NaN at first five index row where "?" is presented.
How to detect missing data:
There are two method used to detect missing data.
.isnull() - Return true at the place of missing data and other place return false.
.notnull() - Return true at the placed data and false at missing data place.
Example:
mis_value = df.isnull()
mis_value.head(5)
Count missing value -In columns
Using for loop:
Example:
Write this for loop and find result
for column in mis_value .columns.values.tolist():
print(column)
print (mis_value [column].value_counts())
print("")
How we will work with missing data
Drop data
drop the whole row- Let suppose any value is necessary like price but it is missing at any row then we remove whole row.
drop the whole column - let we suppose if price is missing at any column then it reason of delete whole column because price is necessary for data science to calculate price.
Replace data
replace it by mean
replace it by frequency - replace as per frequency for example- 84 % is good, and 16% bad, then 16% remove by good.
replace it based on other functions
Calculate the average of any column
Example
avg= df["column name"].astype("float").mean(axis=0)
print("Average of column name:", avg)
Replace "NaN" by mean value - of any column
Example
df["column_name"].replace(np.nan, avg, inplace=True)
Calculate the mean value - of any column
Example
avg=df['column_name'].astype('float').mean(axis=0)
print("Average of column_name:", avg)
Replace NaN by mean value
Example
df["column_name"].replace(np.nan, avg, inplace=True)
How count each column data separately
Use value_counts() function