top of page

Handling Missing Data and Tidying up Missing Information | How to Handle Missing Data | Realcode4you

Missing Information

Introduction

Very often we find ourselves having to work on data set with missing values. There are many representations of missing values. They are known as NULL values in database; in programming language it could be represented as None or NA. Sometimes, missing values are shown as an empty string '', or even numeric values such as 88 or 99.


As we have shown earlier, Pandas displays missing values as NaN.


Testing for Missing Value

Pandas has built-in method to test for a missing value.


# import missing value defined in numpy library
from numpy import NaN, NAN, nan
import pandas as pd
print pd.isnull(NaN)

Out:

True


print pd.isnull(888), pd.isnull('TEST')

Out:

False False


There is also the method to test for non-missing values:


print pd.notnull(NaN), pd.notnull(888), pd.notnull('test')

Out:

False True True


Note that missing values are different from other types of data. There is no concept of equality. It is not equivalent to 0 or empty string or even other missing values, as illustrated by:


print (NaN == NaN)

out:

False


Cleaning Missing Data

There are many techniques we can use to deal with missing data. For example, we can fill in the missing data with another value, fill in the missing data using existing data, or drop the data from our data set.


Replace

Let’s assume that we have a dataframe named myset. If we want the missing values to be recoded as 0, then we use:

myset.fillna(0)

Note that the original data has not be replaced yet. If we want the old missing values to be replaced, we use

myset.fillna(0, inplace=True) 

Fill Forward

We use the fill data forward method to recode data, by replacing the last known/recorded value.

myset.fillna(method='ffill') 

The table below shows how data in a column will be recoded using this method. Notice that the first two rows are not changed because there is no data at first.


Before After

NaN NaN

NaN NaN

2 2

3 3

4 4

NaN 4

NaN 4

5 5

6 6

7 7


Fill Backward

The fill data backward method is the reverse method. The newest recorded value is used to replace the missing data. If the column ends with missing data, then it will remain missing because there is no new value to get.


myset.fillna(method='bfill')

Before After

NaN 2

NaN 2

2 2

NaN 4

4 4

NaN NaN

NaN NaN



Interpolate

Interpolation uses existing values to fill in missing values. Pandas fills in the missing values linearly, by treating the missing values as if they are equally spaced apart. Because interpolation needs two data points (the start and the end), if there are missing values at the start or end of the column, they will not be recoded.


Before After

NaN NaN

NaN NaN

2 2

NaN 3

4 4

6 6

NaN 8

10 10

NaN NaN



Drop Value

The last way to work with missing data is to drop observations or variables with missing data. The syntax is:

set.dropna()

It is vital that you examine the data set carefully before you execute this. Sometimes, missing data is not random. Dropping missing values might leave you with a biased data set, or perhaps leaving you with insufficient data to run analysis. The dropna() method allows you to specify parameters to fine-tune results. For example, if you want to create a copy of the dataframe where each row has at least two non-missing values, you can write:

set = set.dropna(thresh=2)

Calculations with Missing Data

What happens when we manipulate and apply calculations on columns with missing data? To illustrate, let’s create a dataframe with some missing values.


from numpy import NaN
import pandas as pd
df = pd.DataFrame ([ 
 [1,2,3,4,NaN] , 
 [1,2,3,NaN,5] ,
 [NaN,2,3,4,5],
 [1,2,3,4,5]
])

print df

out:

0 1 2 3 4

0 1.0 2 3 4.0 NaN

1 1.0 2 3 NaN 5.0

2 NaN 2 3 4.0 5.0

3 1.0 2 3 4.0 5.0



Columns 0, 3 and 4 contain data points with missing values.


Say we want to add the values of column 0, 3 and 4, and present the information in a new column called result:

df['result']= df[0]+df[3]+df[4]
print df

Out:

0 1 2 3 4 result

0 1.0 2 3 4.0 NaN NaN

1 1.0 2 3 NaN 5.0 NaN

2 NaN 2 3 4.0 5.0 NaN

3 1.0 2 3 4.0 5.0 10.0



You will notice that only the fourth row yields a result. For the rest of the row, because there is at least one missing value in the columns concerned, there will not be any calculated result. Calculations with missing values will typically return a missing value, unless the function or method called has a means to ignore missing values.


Examples of built-in methods that can ignore missing values include mean and sum. These functions will typically have a skipna parameter to calculate a value by skipping over missing values.


Using the dataframe that we just created to illustrate:

print df[0].sum()
print df[0].sum(skipna=False)

Out:

3.0

nan


bottom of page