Handling Missing Data

Most of the times real-world data is rarely clean and homogeneous. In many cases, dataset of interest will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

In this module, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. We refer the missing data as null, NaN, or NA values in general.

Before we start, lets make sure the Pandas and matplotlib packages are installed.

!pip install pandas matplotlib
output
Requirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0) Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0) Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0) Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)
# Now import pandas into your notebook as pd
import pandas as pd

Now again import surveys.csv dataset into our notebook as we did in previous lesson.

surveys_df = pd.read_csv("surveys.csv")

Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,

# set value of x to be 5
x = 5
x > 5
output
False
x == 5
output
True

Finding Missing Values

Let's identify all locations in the survey data that have null (missing or NaN) data values. We can use the isnull method to do this. The isnull method will compare each cell with a null value. If an element has a null value, it will be assigned a value of True in the output object.

pd.isnull(surveys_df).head()
record_id month day year site_id species_id sex hindfoot_length weight
0 False False False False False False False False True
1 False False False False False False False False True
2 False False False False False False False False True
3 False False False False False False False False True
4 False False False False False False False False True

How to select rows with missing data

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]
record_id month day year site_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN
11 12 7 16 1977 7 DM M 38.0 NaN
12 13 7 16 1977 3 DM M 35.0 NaN
13 14 7 16 1977 8 DM NaN NaN NaN
14 15 7 16 1977 6 DM F 36.0 NaN
15 16 7 16 1977 4 DM F 36.0 NaN
16 17 7 16 1977 3 DS F 48.0 NaN
17 18 7 16 1977 2 PP M 22.0 NaN
18 19 7 16 1977 4 PF NaN NaN NaN
19 20 7 17 1977 11 DS F 48.0 NaN
20 21 7 17 1977 14 DM F 34.0 NaN
21 22 7 17 1977 15 NL F 31.0 NaN
22 23 7 17 1977 13 DM M 36.0 NaN
23 24 7 17 1977 13 SH M 21.0 NaN
24 25 7 17 1977 9 DM M 35.0 NaN
25 26 7 17 1977 15 DM M 31.0 NaN
26 27 7 17 1977 15 DM M 36.0 NaN
27 28 7 17 1977 11 DM M 38.0 NaN
28 29 7 17 1977 11 PP M NaN NaN
29 30 7 17 1977 10 DS F 52.0 NaN
... ... ... ... ... ... ... ... ... ...
35187 35188 11 10 2002 10 NaN NaN NaN NaN
35256 35257 12 7 2002 22 PB M 26.0 NaN
35259 35260 12 7 2002 21 PB F 24.0 NaN
35277 35278 12 7 2002 20 AH NaN NaN NaN
35279 35280 12 7 2002 16 PB M 28.0 NaN
35322 35323 12 8 2002 11 AH NaN NaN NaN
35328 35329 12 8 2002 11 PP M NaN 16.0
35370 35371 12 8 2002 14 AH NaN NaN NaN
35378 35379 12 8 2002 15 PB F 26.0 NaN
35384 35385 12 8 2002 10 NaN NaN NaN NaN
35387 35388 12 29 2002 1 DO M 35.0 NaN
35403 35404 12 29 2002 2 NL F 30.0 NaN
35448 35449 12 29 2002 20 OT F 20.0 NaN
35452 35453 12 29 2002 20 PB M 28.0 NaN
35457 35458 12 29 2002 20 AH NaN NaN NaN
35477 35478 12 29 2002 24 AH NaN NaN NaN
35485 35486 12 29 2002 16 DO M 37.0 NaN
35495 35496 12 31 2002 4 PB NaN NaN NaN
35510 35511 12 31 2002 11 DX NaN NaN NaN
35511 35512 12 31 2002 11 US NaN NaN NaN
35512 35513 12 31 2002 11 US NaN NaN NaN
35514 35515 12 31 2002 11 SF F 27.0 NaN
35519 35520 12 31 2002 9 SF NaN 24.0 36.0
35527 35528 12 31 2002 13 US NaN NaN NaN
35529 35530 12 31 2002 13 OT F 20.0 NaN
35530 35531 12 31 2002 13 PB F 27.0 NaN
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

4873 rows × 9 columns

Explaination

Notice that we have 4873 observations/rows that contain one or more missing values. Thats roughly 14% of data contains missing values.

We have used [] convension to select subset of data.

More information about slicing and indexing can be found out here.

(axis=1) is a numpy convention to specify columns.

Note that the weight column of our DataFrame contains many null or NaN values. Next, we will explore ways of dealing with this.

If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.

Dealing with missing values.

Dealing with missing data values is always a challenge. It's sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we're lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.

Where Are the NaN's?

Let's explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):

## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])
output
3266
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])
output
32283

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don't lose our work):

# Creat a new DataFrame using copy
df1 = surveys_df.copy()

# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.

surveys_df['weight'].mean()
output
42.672428212991356
df1['weight'].mean()
output
38.751976145601844

Extra Information

We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.

df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
df1['weight'].mean()
output
42.672428212991356

Writing Out Data to CSV

We've learned about using manipulating data to get desired outputs. But we've also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let's reload the data so we're not mixing up all of our previous manipulations.

df_na = surveys_df.dropna()

If you now type df_na, you should observe that the resulting DataFrame has 30676 rows and 9 columns, much smaller than the 35549 row original.

We can now use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash before the filename: df1.to_csv('foldername/out.csv'). We use 'index=False' so that pandas doesn't include the index number for each line.

# Write DataFrame to CSV
df_na.to_csv('output/surveys_complete.csv', index=False)

Recap

What we've learned:

Extra

We can run isnull on a particular column too. What does the code below do?

# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)
output
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN .. 35530 NaN 35543 NaN 35544 NaN 35545 NaN 35548 NaN Name: weight, Length: 3266, dtype: float64

Let's take a minute to look at the statement above. We are using the Boolean object pd.isnull(surveys_df['weight']) as an index to surveys_df. We are asking Python to select rows that have a NaN value of weight.