Indexing, Slicing and Subsetting

In this lesson, we will explore ways to access different parts of the data in a Pandas DataFrame using:

Indexing,
Slicing, and
Subsetting

Ensure the Pandas package is installed

!pip install pandas matplotlib

output

Requirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0)
Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0)
Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1)
Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)

Loading our data

We will continue to use the surveys dataset that we worked with in the last lesson. Let's reopen and read in the data again:

# Make sure pandas is loaded
import pandas as pd

# Read in the survey CSV
surveys_df = pd.read_csv("surveys.csv")

Indexing and Slicing in Python

We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.

Selecting data using Labels (Column Headings)

We use square brackets [] to select a subset of an Python object. For example, we can select all data from a column named species_id from the surveys_df DataFrame by name. There are two ways to do this:

# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id'].head()

output

0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object

# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id.head()

output

0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object

We can also create a new object that contains only the data within the species_id column as follows:

# Creates an object, surveys_species, that only contains the `species_id` column
surveys_species = surveys_df['species_id']

We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.

NOTE: If a column name is not contained in the DataFrame, an exception (error) will be raised.

# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'site_id']].head()

	species_id	site_id
0	NL	2
1	NL	3
2	DM	2
3	DM	7
4	DM	3

What happens if you ask for a column that doesn't exist?

surveys_df['speciess']

Outputs:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2392             try:
-> 2393                 return self._engine.get_loc(key)
   2394             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'speciess'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-7d65fa0158b8> in <module>()
      1 
      2 # What happens if you ask for a column that doesn't exist?
----> 3 surveys_df['speciess']
      4 

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2060             return self._getitem_multilevel(key)
   2061         else:
-> 2062             return self._getitem_column(key)
   2063 
   2064     def _getitem_column(self, key):

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2067         # get column
   2068         if self.columns.is_unique:
-> 2069             return self._get_item_cache(key)
   2070 
   2071         # duplicate columns & possible reduce dimensionality

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1532         res = cache.get(item)
   1533         if res is None:
-> 1534             values = self._data.get(item)
   1535             res = self._box_item_values(item, values)
   1536             cache[item] = res

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2393                 return self._engine.get_loc(key)
   2394             except KeyError:
-> 2395                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2396 
   2397         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'speciess'

Python tells us what type of error it is in the traceback, at the bottom it says KeyError: 'speciess' which means that speciess is not a column name (or Key in the related python data type dictionary).

# What happens when you flip the order?
surveys_df[['site_id', 'species_id']].head()

	site_id	species_id
0	2	NL
1	3	NL
2	2	DM
3	7	DM
4	3	DM

Extracting Range based Subsets: Slicing

REMINDER: Python Uses 0-based Indexing

Let's remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position 0. This is different from other tools like R and Matlab that index elements within objects starting at 1.

# Create a list of numbers:
a = [1, 2, 3, 4, 5]

indexing diagram slicing diagram

Challenge - Extracting data

What value does the code a[0] return?
How about this: a[5]
In the example above, calling a[5] returns an error. Why is that?
What about a[len(a)] ?

Solutions - Extracting data

Slicing Subsets of Rows in Python

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this with our surveys data:

# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN

The stop bound in Python is different from what you might be used to in languages like Matlab and R.

Now lets select the first 5 rows (rows 0, 1, 2, 3, 4).

surveys_df[:5]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN

# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
surveys_df[-1:]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

We can also reassign values within subsets of our DataFrame.

Let's create a brand new clean dataframe from the original data CSV file.

surveys_df = pd.read_csv("surveys.csv")

Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

loc is primarily label based indexing. Integers may be used but they are interpreted as a label.
iloc is primarily integer based indexing

To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:

iloc[row slicing, column slicing]

surveys_df.iloc[0:3, 1:4]

	month	day	year
0	7	16	1977
1	7	16	1977
2	7	16	1977

Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

Let's explore some other ways to index and select subsets of data:

# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
10	11	7	16	1977	5	DS	F	53.0	NaN

# What does this do?
surveys_df.loc[0, ['species_id', 'site_id', 'weight']]

output

species_id     NL
site_id         2
weight        NaN
Name: 0, dtype: object

# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

output

/Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages/pandas/core/indexing.py:1404: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1.0	7.0	16.0	1977.0	2.0	NL	M	32.0	NaN
10	11.0	7.0	16.0	1977.0	5.0	DS	F	53.0	NaN
35549	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

NOTE: Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels loc differs from indexing by integers iloc. With loc, the both start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]

In following iloc example:

surveys_df.iloc[2, 6]

output

'F'

Remember that Python indexing begins at 0. So, the index location [2, 6] selects the element that is 3 rows down and 7 columns over in the DataFrame.

Challenge - Range

What happens when you execute:
- surveys_df[0:1]
- surveys_df[:4]
- surveys_df[:-1]
What happens when you call:
- surveys_df.iloc[0:4, 1:4]

Subsetting Data using Criteria

We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:

surveys_df[surveys_df.year == 2002].head()

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
33320	33321	1	12	2002	1	DM	M	38.0	44.0
33321	33322	1	12	2002	1	DO	M	37.0	58.0
33322	33323	1	12	2002	1	PB	M	28.0	45.0
33323	33324	1	12	2002	1	AB	NaN	NaN	NaN
33324	33325	1	12	2002	1	DO	M	35.0	29.0

Or we can select all rows that do not contain the year 2002:

surveys_df[surveys_df.year != 2002]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN
6	7	7	16	1977	2	PE	F	NaN	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
9	10	7	16	1977	6	PF	F	20.0	NaN
10	11	7	16	1977	5	DS	F	53.0	NaN
11	12	7	16	1977	7	DM	M	38.0	NaN
12	13	7	16	1977	3	DM	M	35.0	NaN
13	14	7	16	1977	8	DM	NaN	NaN	NaN
14	15	7	16	1977	6	DM	F	36.0	NaN
15	16	7	16	1977	4	DM	F	36.0	NaN
16	17	7	16	1977	3	DS	F	48.0	NaN
17	18	7	16	1977	2	PP	M	22.0	NaN
18	19	7	16	1977	4	PF	NaN	NaN	NaN
19	20	7	17	1977	11	DS	F	48.0	NaN
20	21	7	17	1977	14	DM	F	34.0	NaN
21	22	7	17	1977	15	NL	F	31.0	NaN
22	23	7	17	1977	13	DM	M	36.0	NaN
23	24	7	17	1977	13	SH	M	21.0	NaN
24	25	7	17	1977	9	DM	M	35.0	NaN
25	26	7	17	1977	15	DM	M	31.0	NaN
26	27	7	17	1977	15	DM	M	36.0	NaN
27	28	7	17	1977	11	DM	M	38.0	NaN
28	29	7	17	1977	11	PP	M	NaN	NaN
29	30	7	17	1977	10	DS	F	52.0	NaN
...	...	...	...	...	...	...	...	...	...
33290	33291	12	15	2001	23	PE	M	20.0	18.0
33291	33292	12	15	2001	23	RM	F	16.0	8.0
33292	33293	12	15	2001	20	PE	F	20.0	22.0
33293	33294	12	15	2001	20	SH	M	25.0	43.0
33294	33295	12	15	2001	20	PB	F	27.0	33.0
33295	33296	12	15	2001	20	PB	M	25.0	35.0
33296	33297	12	15	2001	20	RM	M	16.0	11.0
33297	33298	12	15	2001	20	RM	F	16.0	8.0
33298	33299	12	15	2001	20	PB	F	25.0	28.0
33299	33300	12	15	2001	20	PB	F	26.0	30.0
33300	33301	12	15	2001	20	PB	F	27.0	31.0
33301	33302	12	15	2001	24	PE	M	20.0	24.0
33302	33303	12	15	2001	24	PE	M	20.0	23.0
33303	33304	12	15	2001	24	RM	M	16.0	10.0
33304	33305	12	15	2001	7	PB	M	29.0	44.0
33305	33306	12	15	2001	7	OT	M	19.0	21.0
33306	33307	12	15	2001	7	OT	M	20.0	19.0
33307	33308	12	15	2001	7	PP	M	24.0	16.0
33308	33309	12	16	2001	3	NaN	NaN	NaN	NaN
33309	33310	12	16	2001	4	NaN	NaN	NaN	NaN
33310	33311	12	16	2001	5	NaN	NaN	NaN	NaN
33311	33312	12	16	2001	6	NaN	NaN	NaN	NaN
33312	33313	12	16	2001	8	NaN	NaN	NaN	NaN
33313	33314	12	16	2001	9	NaN	NaN	NaN	NaN
33314	33315	12	16	2001	10	NaN	NaN	NaN	NaN
33315	33316	12	16	2001	11	NaN	NaN	NaN	NaN
33316	33317	12	16	2001	13	NaN	NaN	NaN	NaN
33317	33318	12	16	2001	14	NaN	NaN	NaN	NaN
33318	33319	12	16	2001	15	NaN	NaN	NaN	NaN
33319	33320	12	16	2001	16	NaN	NaN	NaN	NaN

33320 rows × 9 columns

We can define sets of criteria too:

surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
2270	2271	1	15	1980	8	DO	M	35.0	53.0
2271	2272	1	15	1980	11	PF	F	16.0	10.0
2272	2273	1	15	1980	18	DM	F	34.0	33.0
2273	2274	1	15	1980	11	DM	M	38.0	37.0
2274	2275	1	15	1980	8	DO	F	33.0	29.0
2275	2276	1	15	1980	11	DS	M	47.0	132.0
2276	2277	1	15	1980	8	PF	M	15.0	8.0
2277	2278	1	15	1980	9	OT	M	21.0	23.0
2278	2279	1	15	1980	11	DM	F	36.0	36.0
2279	2280	1	15	1980	21	OT	F	20.0	21.0
2280	2281	1	15	1980	11	OL	M	20.0	29.0
2281	2282	1	15	1980	17	DM	F	36.0	49.0
2282	2283	1	15	1980	11	OL	M	21.0	23.0
2283	2284	1	15	1980	9	OL	M	20.0	32.0
2284	2285	1	15	1980	10	OL	F	20.0	24.0
2285	2286	1	15	1980	11	DM	M	38.0	47.0
2286	2287	1	15	1980	21	OT	M	19.0	22.0
2287	2288	1	15	1980	19	RM	F	17.0	12.0
2288	2289	1	15	1980	20	DS	F	52.0	150.0
2289	2290	1	15	1980	11	DM	M	37.0	49.0
2290	2291	1	15	1980	9	OL	F	21.0	34.0
2291	2292	1	15	1980	12	DM	F	35.0	40.0
2292	2293	1	15	1980	18	DS	F	51.0	132.0
2293	2294	1	15	1980	22	DM	F	34.0	25.0
2294	2295	1	15	1980	9	OL	M	21.0	36.0
2295	2296	1	15	1980	8	DO	F	34.0	50.0
2296	2297	1	15	1980	11	DM	M	37.0	45.0
2297	2298	1	15	1980	17	DM	M	35.0	47.0
2298	2299	1	15	1980	9	DM	M	38.0	46.0
2299	2300	1	15	1980	18	DM	F	32.0	29.0
...	...	...	...	...	...	...	...	...	...
11197	11198	12	8	1985	4	DS	M	45.0	129.0
11198	11199	12	8	1985	8	DM	F	38.0	42.0
11199	11200	12	8	1985	7	AB	NaN	NaN	NaN
11200	11201	12	8	1985	5	OL	M	21.0	29.0
11201	11202	12	8	1985	9	DM	F	35.0	39.0
11202	11203	12	8	1985	7	PE	F	17.0	19.0
11203	11204	12	8	1985	3	PP	F	22.0	16.0
11204	11205	12	8	1985	5	DO	M	37.0	56.0
11205	11206	12	8	1985	11	DM	F	38.0	38.0
11206	11207	12	8	1985	2	PE	M	18.0	19.0
11207	11208	12	8	1985	8	DS	F	50.0	120.0
11208	11209	12	8	1985	2	DO	F	37.0	52.0
11209	11210	12	8	1985	2	DM	F	35.0	40.0
11210	11211	12	8	1985	13	DM	M	37.0	45.0
11211	11212	12	8	1985	4	DS	NaN	NaN	121.0
11212	11213	12	8	1985	13	AH	NaN	NaN	NaN
11213	11214	12	8	1985	1	DM	F	37.0	44.0
11214	11215	12	8	1985	2	NL	F	32.0	160.0
11215	11216	12	8	1985	3	RM	M	17.0	9.0
11216	11217	12	8	1985	4	OL	M	24.0	34.0
11217	11218	12	8	1985	9	DM	F	36.0	39.0
11218	11219	12	8	1985	8	DM	F	38.0	41.0
11219	11220	12	8	1985	5	DO	F	37.0	56.0
11220	11221	12	8	1985	13	AH	NaN	NaN	NaN
11221	11222	12	8	1985	7	AB	NaN	NaN	NaN
11222	11223	12	8	1985	4	DM	M	36.0	40.0
11223	11224	12	8	1985	11	DM	M	37.0	49.0
11224	11225	12	8	1985	7	PE	M	20.0	18.0
11225	11226	12	8	1985	1	DM	M	38.0	47.0
11226	11227	12	8	1985	15	NaN	NaN	NaN	NaN

8957 rows × 9 columns

Python Syntax Cheat Sheet

Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

Equals: ==
Not equals: !=
Greater than, less than: > or <
Greater than or equal to >=
Less than or equal to <=

Challenge - Queries

Select a subset of rows in the surveys_df DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many rows did you end up with? What did your neighbor get?
(Extra) Use the isin function to find all plots that contain PB and PL species in the "surveys" DataFrame. How many records contain these values?

You can use the isin command in Python to query a DataFrame based upon a list of values as follows:

surveys_df[surveys_df['species_id'].isin([listGoesHere])]

Extra Challenges

(Extra) Create a query that finds all rows with a weight value greater than (>) or equal to 0.
(Extra) The ~ symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with sex NOT equal to 'M' or 'F' in the "surveys" data.

Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,

# Set x to 5
x = 5

# What does the code below return?
x > 5

output

False

# How about this?
x == 5

output

True

Extra Challenges - Putting it all together

Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of 'x'. Determine the number of null values in the subset.
Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.
Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column.

# Solution extra challenge 2

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) &
                            surveys_df["weight"] > 0.][["sex", "weight", "site_id"]]

# calculate the mean weight for each site id and sex combination:
stack_selection = stack_selection.groupby(["site_id", "sex"]).mean().unstack()

# Plot inside jupyter notebook
%matplotlib inline

# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)

output

png

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search