WORKSHOPS
03 Indexing, Slicing and Subsetting
In lesson 02, we read a CSV into a Python pandas DataFrame. We learned:
- How to save the DataFrame to a named object,
- How to perform basic math on the data,
- How to calculate summary statistics, and
- How to create basic plots of the data.
In this lesson, we will explore ways to access different parts of the data using:
- Indexing,
- Slicing, and
- Subsetting
Loading our data
We will continue to use the surveys dataset that we worked with in the last lesson. Let’s reopen and read in the data again:
# Make sure pandas is loaded
import pandas as pd
# Read in the survey CSV
surveys_df = pd.read_csv("data/surveys.csv")
Indexing and Slicing in Python
We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.
Selecting data using Labels (Column Headings)
We use square brackets []
to select a subset of an Python object. For example,
we can select all data from a column named species_id
from the surveys_df
DataFrame by name. There are two ways to do this:
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id'].head()
# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id.head()
We can also create a new object that contains only the data within the
species_id
column as follows:
# Creates an object, surveys_species, that only contains the `species_id` column
surveys_species = surveys_df['species_id']
We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.
NOTE: If a column name is not contained in the DataFrame, an exception (error) will be raised.
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'plot_id']].head()
# What happens when you flip the order?
surveys_df[['plot_id', 'species_id']].head()
# What happens if you ask for a column that doesn't exist?
surveys_df['speciess']
Python tells us what type of error it is in the traceback, at the bottom it says KeyError: 'speciess'
which means that speciess
is not a column name (or Key in the related python data type dictionary).
Extracting Range based Subsets: Slicing
REMINDER: Python Uses 0-based Indexing
Let’s remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position 0. This is different from other tools like R and Matlab that index elements within objects starting at 1.
# Create a list of numbers:
a = [1, 2, 3, 4, 5]
Challenge - Extracting data
What value does the code a[0] return?
How about this: a[5]
In the example above, calling
a[5]
returns an error. Why is that?What about a[len(a)] ?
Slicing Subsets of Rows in Python
Slicing using the []
operator selects a set of rows and/or columns from a
DataFrame. To slice out a set of rows, you use the following syntax:
data[start:stop]
. When slicing in pandas the start bound is included in the
output. The stop bound is one step BEYOND the row you want to select. So if you
want to select rows 0, 1 and 2 your code would look like this:
# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]
The stop bound in Python is different from what you might be used to in languages like Matlab and R.
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
surveys_df[:5]
# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
surveys_df[-1:]
We can also reassign values within subsets of our DataFrame.
Let’s create a brand new clean dataframe from the original data CSV file.
surveys_df = pd.read_csv("data/surveys.csv")
Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.
loc
is primarily label based indexing. Integers may be used but they are interpreted as a label.iloc
is primarily integer based indexing
To select a subset of rows and columns from our DataFrame, we can use the
iloc
method. For example, we can select month, day and year (columns 2, 3
and 4 if we start counting at 1), like this:
# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]
which gives the output
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977
Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.
Let’s explore some other ways to index and select subsets of data:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]
# What does this do?
surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]
NOTE: Labels must be found in the DataFrame or you will get a KeyError
.
Indexing by labels loc
differs from indexing by integers iloc
.
With loc
, the both start bound and the stop bound are inclusive. When using
loc
, integers can be used, but the integers refer to the
index label and not the position. For example, using loc
and select 1:4
will get a different result than using iloc
to select rows 1:4.
We can also select a specific data value using a row and
column location within the DataFrame and iloc
indexing:
# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]
In this iloc
example,
surveys_df.iloc[2, 6]
gives the output
'F'
Remember that Python indexing begins at 0. So, the index location [2, 6] selects the element that is 3 rows down and 7 columns over in the DataFrame.
Challenge - Range
What happens when you execute:
surveys_df[0:1]
surveys_df[:4]
surveys_df[:-1]
What happens when you call:
surveys_df.iloc[0:4, 1:4]
Subsetting Data using Criteria
We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:
surveys_df[surveys_df.year == 2002].head()
Which produces the following output:
record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38 44
33321 33322 1 12 2002 1 DO M 37 58
33322 33323 1 12 2002 1 PB M 28 45
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35 29
...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15 14
35547 35548 12 31 2002 7 DO M 36 51
35548 35549 12 31 2002 5 NaN NaN NaN NaN
[2229 rows x 9 columns]
Or we can select all rows that do not contain the year 2002:
surveys_df[surveys_df.year != 2002]
We can define sets of criteria too:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]
Python Syntax Cheat Sheet
Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the “surveys” data.
- Equals:
==
- Not equals:
!=
- Greater than, less than:
>
or<
- Greater than or equal to
>=
- Less than or equal to
<=
Challenge - Queries
Select a subset of rows in the
surveys_df
DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many rows did you end up with? What did your neighbor get?(Extra) You can use the
isin
command in Python to query a DataFrame based upon a list of values as follows:
surveys_df[surveys_df['species_id'].isin([listGoesHere])]
Use the isin
function to find all plots that contain particular species
in the “surveys” DataFrame. How many records contain these values?
(Extra) Experiment with other queries. Create a query that finds all rows with a weight value > or equal to 0.
(Extra) The
~
symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with sex NOT equal to ’M’ or ‘F’ in the “surveys” data.
Using masks to identify a specific condition
A mask can be useful to locate where a particular subset of values exist or
don’t exist - for example, NaN, or “Not a Number” values. To understand masks,
we also need to understand BOOLEAN
objects in Python.
Boolean values include True
or False
. For example,
# Set x to 5
x = 5
# What does the code below return?
x > 5
# How about this?
x == 5
Missing Values
Let’s try this out. Let’s identify all locations in the survey data that have
null (missing or NaN) data values. We can use the isnull
method to do this.
The isnull
method will compare each cell with a null value. If an element
has a null value, it will be assigned a value of True
in the output object.
pd.isnull(surveys_df).head()
A snippet of the output is below:
record_id month day year plot_id species_id sex hindfoot_length weight
0 False False False False False False False False True
1 False False False False False False False False True
2 False False False False False False False False True
3 False False False False False False False False True
4 False False False False False False False False True
[35549 rows x 9 columns]
To select the rows where there are null values, we can use the mask as an index to subset our data as follows:
# To select just the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]
(axis=1) is a numpy convention to specify columns.
Note that the weight
column of our DataFrame contains many null
or NaN
values. Next, we will explore ways of dealing with this.
If we look at the weight
column in the surveys
data we notice that there are NaN (Not a Number) values. NaN values are undefined
values that cannot be represented mathematically. Pandas, for example, will read
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
were to average the weight
column without replacing our NaNs, Python would know to skip
over those cells.
surveys_df['weight'].mean()
42.672428212991356
Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.
For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Where Are the NaN’s?
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):
len(surveys_df[pd.isnull(surveys_df.weight)])
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])
We can replace all NaN values with zeroes using the .fillna()
method (after
making a copy of the data so we don’t lose our work):
# Creat a new DataFrame using copy
df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)
However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.
df1['weight'].mean()
38.751976145601844
Extra Information
We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
Writing Out Data to CSV
We’ve learned about using manipulating data to get desired outputs. But we’ve also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let’s reload the data so we’re not mixing up all of our previous manipulations.
surveys_df = pd.read_csv("data/surveys.csv")
Next, let’s drop all the rows that contain missing values. We will use the command dropna
.
By default, dropna removes columns that contain missing data for even just one row.
df_na = df.dropna()
If you now type df_na
, you should observe that the resulting DataFrame has 30676 rows
and 9 columns, much smaller than the 35549 row original.
We can now use the to_csv
command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash before the filename:
df1.to_csv('foldername/out.csv')
. We use ‘index=False’ so that
pandas doesn’t include the index number for each line.
# Write DataFrame to CSV
df_na.to_csv('data_output/surveys_complete.csv', index=False)
Recap
What we’ve learned:
- How to subset and index the dataframe
- What NaN values are, how they might be represented, and what this means for your work
- How to replace NaN values, if desired
- How to use
to_csv
to write manipulated data to a file.
Extra
We can run isnull
on a particular column too. What does the code below do?
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)
Let’s take a minute to look at the statement above. We are using the Boolean
object pd.isnull(surveys_df['weight'])
as an index to surveys_df
. We are
asking Python to select rows that have a NaN
value of weight.
Extra Challenges - Putting it all together
Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.
Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.
- Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column.