Handling Missing Data
Most of the times real-world data is rarely clean and homogeneous. In many cases, dataset of interest will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.
In this module, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. We refer the missing data as null, NaN, or NA values in general.
Before we start, lets make sure the Pandas and matplotlib packages are installed.
!pip install pandas matplotlib
outputRequirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0) Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0) Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0) Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)
# Now import pandas into your notebook as pd
import pandas as pd
Now again import surveys.csv
dataset into our notebook as we did in previous lesson.
surveys_df = pd.read_csv("surveys.csv")
Using masks to identify a specific condition
A mask can be useful to locate where a particular subset of values exist or
don't exist - for example, NaN, or "Not a Number" values. To understand masks,
we also need to understand BOOLEAN
objects in Python.
Boolean values include True
or False
. For example,
# set value of x to be 5
x = 5
x > 5
outputFalse
x == 5
outputTrue
Finding Missing Values
Let's identify all locations in the survey data that have
null (missing or NaN) data values. We can use the isnull
method to do this.
The isnull
method will compare each cell with a null value. If an element
has a null value, it will be assigned a value of True
in the output object.
pd.isnull(surveys_df).head()
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | True |
1 | False | False | False | False | False | False | False | False | True |
2 | False | False | False | False | False | False | False | False | True |
3 | False | False | False | False | False | False | False | False | True |
4 | False | False | False | False | False | False | False | False | True |
How to select rows with missing data
To select the rows where there are null values, we can use the mask as an index to subset our data as follows:
# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
20 | 21 | 7 | 17 | 1977 | 14 | DM | F | 34.0 | NaN |
21 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN |
22 | 23 | 7 | 17 | 1977 | 13 | DM | M | 36.0 | NaN |
23 | 24 | 7 | 17 | 1977 | 13 | SH | M | 21.0 | NaN |
24 | 25 | 7 | 17 | 1977 | 9 | DM | M | 35.0 | NaN |
25 | 26 | 7 | 17 | 1977 | 15 | DM | M | 31.0 | NaN |
26 | 27 | 7 | 17 | 1977 | 15 | DM | M | 36.0 | NaN |
27 | 28 | 7 | 17 | 1977 | 11 | DM | M | 38.0 | NaN |
28 | 29 | 7 | 17 | 1977 | 11 | PP | M | NaN | NaN |
29 | 30 | 7 | 17 | 1977 | 10 | DS | F | 52.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35187 | 35188 | 11 | 10 | 2002 | 10 | NaN | NaN | NaN | NaN |
35256 | 35257 | 12 | 7 | 2002 | 22 | PB | M | 26.0 | NaN |
35259 | 35260 | 12 | 7 | 2002 | 21 | PB | F | 24.0 | NaN |
35277 | 35278 | 12 | 7 | 2002 | 20 | AH | NaN | NaN | NaN |
35279 | 35280 | 12 | 7 | 2002 | 16 | PB | M | 28.0 | NaN |
35322 | 35323 | 12 | 8 | 2002 | 11 | AH | NaN | NaN | NaN |
35328 | 35329 | 12 | 8 | 2002 | 11 | PP | M | NaN | 16.0 |
35370 | 35371 | 12 | 8 | 2002 | 14 | AH | NaN | NaN | NaN |
35378 | 35379 | 12 | 8 | 2002 | 15 | PB | F | 26.0 | NaN |
35384 | 35385 | 12 | 8 | 2002 | 10 | NaN | NaN | NaN | NaN |
35387 | 35388 | 12 | 29 | 2002 | 1 | DO | M | 35.0 | NaN |
35403 | 35404 | 12 | 29 | 2002 | 2 | NL | F | 30.0 | NaN |
35448 | 35449 | 12 | 29 | 2002 | 20 | OT | F | 20.0 | NaN |
35452 | 35453 | 12 | 29 | 2002 | 20 | PB | M | 28.0 | NaN |
35457 | 35458 | 12 | 29 | 2002 | 20 | AH | NaN | NaN | NaN |
35477 | 35478 | 12 | 29 | 2002 | 24 | AH | NaN | NaN | NaN |
35485 | 35486 | 12 | 29 | 2002 | 16 | DO | M | 37.0 | NaN |
35495 | 35496 | 12 | 31 | 2002 | 4 | PB | NaN | NaN | NaN |
35510 | 35511 | 12 | 31 | 2002 | 11 | DX | NaN | NaN | NaN |
35511 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35512 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35514 | 35515 | 12 | 31 | 2002 | 11 | SF | F | 27.0 | NaN |
35519 | 35520 | 12 | 31 | 2002 | 9 | SF | NaN | 24.0 | 36.0 |
35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
35529 | 35530 | 12 | 31 | 2002 | 13 | OT | F | 20.0 | NaN |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
4873 rows × 9 columns
Explaination
Notice that we have 4873
observations/rows that contain one or more missing values.
Thats roughly 14%
of data contains missing values.
We have used []
convension to select subset of data.
More information about slicing and indexing can be found out here.
(axis=1) is a numpy convention to specify columns.
Note that the weight
column of our DataFrame contains many null
or NaN
values. Next, we will explore ways of dealing with this.
If we look at the weight
column in the surveys
data we notice that there are NaN (Not a Number) values. NaN values are undefined
values that cannot be represented mathematically. Pandas, for example, will read
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
were to average the weight
column without replacing our NaNs, Python would know to skip
over those cells.
Dealing with missing values.
Dealing with missing data values is always a challenge. It's sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we're lucky, we have some metadata that will tell us more about how null values were handled.
For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Where Are the NaN's?
Let's explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):
## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])
output3266
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])
output32283
We can replace all NaN values with zeroes using the .fillna()
method (after
making a copy of the data so we don't lose our work):
# Creat a new DataFrame using copy
df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)
However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.
surveys_df['weight'].mean()
output42.672428212991356
df1['weight'].mean()
output38.751976145601844
Extra Information
We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
df1['weight'].mean()
output42.672428212991356
Writing Out Data to CSV
We've learned about using manipulating data to get desired outputs. But we've also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let's reload the data so we're not mixing up all of our previous manipulations.
df_na = surveys_df.dropna()
If you now type df_na
, you should observe that the resulting DataFrame has 30676
rows
and 9 columns, much smaller than the 35549
row original.
We can now use the to_csv
command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash before the filename:
df1.to_csv('foldername/out.csv')
.
We use 'index=False' so that
pandas doesn't include the index number for each line.
# Write DataFrame to CSV
df_na.to_csv('output/surveys_complete.csv', index=False)
Recap
What we've learned:
- What NaN values are, how they might be represented, and what this means for your work
- How to replace NaN values, if desired
- How to use
to_csv
to write manipulated data to a file.
Extra
We can run isnull
on a particular column too. What does the code below do?
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)
output0 NaN 1 NaN 2 NaN 3 NaN 4 NaN .. 35530 NaN 35543 NaN 35544 NaN 35545 NaN 35548 NaN Name: weight, Length: 3266, dtype: float64
Let's take a minute to look at the statement above. We are using the Boolean
object pd.isnull(surveys_df['weight'])
as an index to surveys_df
. We are
asking Python to select rows that have a NaN
value of weight.