Indexing, Slicing and Subsetting

In this lesson, we will explore ways to access different parts of the data in a Pandas DataFrame using:

Ensure the Pandas package is installed

!pip install pandas matplotlib
output
Requirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0) Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0) Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1) Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1) Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)

Loading our data

We will continue to use the surveys dataset that we worked with in the last lesson. Let's reopen and read in the data again:

# Make sure pandas is loaded
import pandas as pd

# Read in the survey CSV
surveys_df = pd.read_csv("surveys.csv")

Indexing and Slicing in Python

We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.

Selecting data using Labels (Column Headings)

We use square brackets [] to select a subset of an Python object. For example, we can select all data from a column named species_id from the surveys_df DataFrame by name. There are two ways to do this:

# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id'].head()
output
0 NL 1 NL 2 DM 3 DM 4 DM Name: species_id, dtype: object
# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id.head()
output
0 NL 1 NL 2 DM 3 DM 4 DM Name: species_id, dtype: object

We can also create a new object that contains only the data within the species_id column as follows:

# Creates an object, surveys_species, that only contains the `species_id` column
surveys_species = surveys_df['species_id']

We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.

NOTE: If a column name is not contained in the DataFrame, an exception (error) will be raised.

# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'site_id']].head()
species_id site_id
0 NL 2
1 NL 3
2 DM 2
3 DM 7
4 DM 3

What happens if you ask for a column that doesn't exist?

surveys_df['speciess']

Outputs:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2392             try:
-> 2393                 return self._engine.get_loc(key)
   2394             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'speciess'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-7d65fa0158b8> in <module>()
      1 
      2 # What happens if you ask for a column that doesn't exist?
----> 3 surveys_df['speciess']
      4 

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2060             return self._getitem_multilevel(key)
   2061         else:
-> 2062             return self._getitem_column(key)
   2063 
   2064     def _getitem_column(self, key):

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2067         # get column
   2068         if self.columns.is_unique:
-> 2069             return self._get_item_cache(key)
   2070 
   2071         # duplicate columns & possible reduce dimensionality

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1532         res = cache.get(item)
   1533         if res is None:
-> 1534             values = self._data.get(item)
   1535             res = self._box_item_values(item, values)
   1536             cache[item] = res

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2393                 return self._engine.get_loc(key)
   2394             except KeyError:
-> 2395                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2396 
   2397         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()

KeyError: 'speciess'

Python tells us what type of error it is in the traceback, at the bottom it says KeyError: 'speciess' which means that speciess is not a column name (or Key in the related python data type dictionary).

# What happens when you flip the order?
surveys_df[['site_id', 'species_id']].head()
site_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM

Extracting Range based Subsets: Slicing

REMINDER: Python Uses 0-based Indexing

Let's remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position 0. This is different from other tools like R and Matlab that index elements within objects starting at 1.

# Create a list of numbers:
a = [1, 2, 3, 4, 5]

indexing diagram slicing diagram

Challenge - Extracting data

  1. What value does the code a[0] return?

  2. How about this: a[5]

  3. In the example above, calling a[5] returns an error. Why is that?

  4. What about a[len(a)] ?

Solutions - Extracting data

Slicing Subsets of Rows in Python

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this with our surveys data:

# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]
record_id month day year site_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

The stop bound in Python is different from what you might be used to in languages like Matlab and R.

Now lets select the first 5 rows (rows 0, 1, 2, 3, 4).

surveys_df[:5]
record_id month day year site_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
surveys_df[-1:]
record_id month day year site_id species_id sex hindfoot_length weight
35548 35549 12 31 2002 5 NaN NaN NaN NaN

We can also reassign values within subsets of our DataFrame.

Let's create a brand new clean dataframe from the original data CSV file.

surveys_df = pd.read_csv("surveys.csv")

Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:

iloc[row slicing, column slicing]

surveys_df.iloc[0:3, 1:4]
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

Let's explore some other ways to index and select subsets of data:

# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]
record_id month day year site_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN
# What does this do?
surveys_df.loc[0, ['species_id', 'site_id', 'weight']]
output
species_id NL site_id 2 weight NaN Name: 0, dtype: object
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]
output
/Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages/pandas/core/indexing.py:1404: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike return self._getitem_tuple(key)
record_id month day year site_id species_id sex hindfoot_length weight
0 1.0 7.0 16.0 1977.0 2.0 NL M 32.0 NaN
10 11.0 7.0 16.0 1977.0 5.0 DS F 53.0 NaN
35549 NaN NaN NaN NaN NaN NaN NaN NaN NaN

NOTE: Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels loc differs from indexing by integers iloc. With loc, the both start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]

In following iloc example:

surveys_df.iloc[2, 6]
output
'F'

Remember that Python indexing begins at 0. So, the index location [2, 6] selects the element that is 3 rows down and 7 columns over in the DataFrame.

Challenge - Range

  1. What happens when you execute:

    • surveys_df[0:1]
    • surveys_df[:4]
    • surveys_df[:-1]
  2. What happens when you call:

    • surveys_df.iloc[0:4, 1:4]

Subsetting Data using Criteria

We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:

surveys_df[surveys_df.year == 2002].head()
record_id month day year site_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0

Or we can select all rows that do not contain the year 2002:

surveys_df[surveys_df.year != 2002]
record_id month day year site_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN
11 12 7 16 1977 7 DM M 38.0 NaN
12 13 7 16 1977 3 DM M 35.0 NaN
13 14 7 16 1977 8 DM NaN NaN NaN
14 15 7 16 1977 6 DM F 36.0 NaN
15 16 7 16 1977 4 DM F 36.0 NaN
16 17 7 16 1977 3 DS F 48.0 NaN
17 18 7 16 1977 2 PP M 22.0 NaN
18 19 7 16 1977 4 PF NaN NaN NaN
19 20 7 17 1977 11 DS F 48.0 NaN
20 21 7 17 1977 14 DM F 34.0 NaN
21 22 7 17 1977 15 NL F 31.0 NaN
22 23 7 17 1977 13 DM M 36.0 NaN
23 24 7 17 1977 13 SH M 21.0 NaN
24 25 7 17 1977 9 DM M 35.0 NaN
25 26 7 17 1977 15 DM M 31.0 NaN
26 27 7 17 1977 15 DM M 36.0 NaN
27 28 7 17 1977 11 DM M 38.0 NaN
28 29 7 17 1977 11 PP M NaN NaN
29 30 7 17 1977 10 DS F 52.0 NaN
... ... ... ... ... ... ... ... ... ...
33290 33291 12 15 2001 23 PE M 20.0 18.0
33291 33292 12 15 2001 23 RM F 16.0 8.0
33292 33293 12 15 2001 20 PE F 20.0 22.0
33293 33294 12 15 2001 20 SH M 25.0 43.0
33294 33295 12 15 2001 20 PB F 27.0 33.0
33295 33296 12 15 2001 20 PB M 25.0 35.0
33296 33297 12 15 2001 20 RM M 16.0 11.0
33297 33298 12 15 2001 20 RM F 16.0 8.0
33298 33299 12 15 2001 20 PB F 25.0 28.0
33299 33300 12 15 2001 20 PB F 26.0 30.0
33300 33301 12 15 2001 20 PB F 27.0 31.0
33301 33302 12 15 2001 24 PE M 20.0 24.0
33302 33303 12 15 2001 24 PE M 20.0 23.0
33303 33304 12 15 2001 24 RM M 16.0 10.0
33304 33305 12 15 2001 7 PB M 29.0 44.0
33305 33306 12 15 2001 7 OT M 19.0 21.0
33306 33307 12 15 2001 7 OT M 20.0 19.0
33307 33308 12 15 2001 7 PP M 24.0 16.0
33308 33309 12 16 2001 3 NaN NaN NaN NaN
33309 33310 12 16 2001 4 NaN NaN NaN NaN
33310 33311 12 16 2001 5 NaN NaN NaN NaN
33311 33312 12 16 2001 6 NaN NaN NaN NaN
33312 33313 12 16 2001 8 NaN NaN NaN NaN
33313 33314 12 16 2001 9 NaN NaN NaN NaN
33314 33315 12 16 2001 10 NaN NaN NaN NaN
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

33320 rows × 9 columns

We can define sets of criteria too:

surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]
record_id month day year site_id species_id sex hindfoot_length weight
2270 2271 1 15 1980 8 DO M 35.0 53.0
2271 2272 1 15 1980 11 PF F 16.0 10.0
2272 2273 1 15 1980 18 DM F 34.0 33.0
2273 2274 1 15 1980 11 DM M 38.0 37.0
2274 2275 1 15 1980 8 DO F 33.0 29.0
2275 2276 1 15 1980 11 DS M 47.0 132.0
2276 2277 1 15 1980 8 PF M 15.0 8.0
2277 2278 1 15 1980 9 OT M 21.0 23.0
2278 2279 1 15 1980 11 DM F 36.0 36.0
2279 2280 1 15 1980 21 OT F 20.0 21.0
2280 2281 1 15 1980 11 OL M 20.0 29.0
2281 2282 1 15 1980 17 DM F 36.0 49.0
2282 2283 1 15 1980 11 OL M 21.0 23.0
2283 2284 1 15 1980 9 OL M 20.0 32.0
2284 2285 1 15 1980 10 OL F 20.0 24.0
2285 2286 1 15 1980 11 DM M 38.0 47.0
2286 2287 1 15 1980 21 OT M 19.0 22.0
2287 2288 1 15 1980 19 RM F 17.0 12.0
2288 2289 1 15 1980 20 DS F 52.0 150.0
2289 2290 1 15 1980 11 DM M 37.0 49.0
2290 2291 1 15 1980 9 OL F 21.0 34.0
2291 2292 1 15 1980 12 DM F 35.0 40.0
2292 2293 1 15 1980 18 DS F 51.0 132.0
2293 2294 1 15 1980 22 DM F 34.0 25.0
2294 2295 1 15 1980 9 OL M 21.0 36.0
2295 2296 1 15 1980 8 DO F 34.0 50.0
2296 2297 1 15 1980 11 DM M 37.0 45.0
2297 2298 1 15 1980 17 DM M 35.0 47.0
2298 2299 1 15 1980 9 DM M 38.0 46.0
2299 2300 1 15 1980 18 DM F 32.0 29.0
... ... ... ... ... ... ... ... ... ...
11197 11198 12 8 1985 4 DS M 45.0 129.0
11198 11199 12 8 1985 8 DM F 38.0 42.0
11199 11200 12 8 1985 7 AB NaN NaN NaN
11200 11201 12 8 1985 5 OL M 21.0 29.0
11201 11202 12 8 1985 9 DM F 35.0 39.0
11202 11203 12 8 1985 7 PE F 17.0 19.0
11203 11204 12 8 1985 3 PP F 22.0 16.0
11204 11205 12 8 1985 5 DO M 37.0 56.0
11205 11206 12 8 1985 11 DM F 38.0 38.0
11206 11207 12 8 1985 2 PE M 18.0 19.0
11207 11208 12 8 1985 8 DS F 50.0 120.0
11208 11209 12 8 1985 2 DO F 37.0 52.0
11209 11210 12 8 1985 2 DM F 35.0 40.0
11210 11211 12 8 1985 13 DM M 37.0 45.0
11211 11212 12 8 1985 4 DS NaN NaN 121.0
11212 11213 12 8 1985 13 AH NaN NaN NaN
11213 11214 12 8 1985 1 DM F 37.0 44.0
11214 11215 12 8 1985 2 NL F 32.0 160.0
11215 11216 12 8 1985 3 RM M 17.0 9.0
11216 11217 12 8 1985 4 OL M 24.0 34.0
11217 11218 12 8 1985 9 DM F 36.0 39.0
11218 11219 12 8 1985 8 DM F 38.0 41.0
11219 11220 12 8 1985 5 DO F 37.0 56.0
11220 11221 12 8 1985 13 AH NaN NaN NaN
11221 11222 12 8 1985 7 AB NaN NaN NaN
11222 11223 12 8 1985 4 DM M 36.0 40.0
11223 11224 12 8 1985 11 DM M 37.0 49.0
11224 11225 12 8 1985 7 PE M 20.0 18.0
11225 11226 12 8 1985 1 DM M 38.0 47.0
11226 11227 12 8 1985 15 NaN NaN NaN NaN

8957 rows × 9 columns

Python Syntax Cheat Sheet

Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

Challenge - Queries

  1. Select a subset of rows in the surveys_df DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many rows did you end up with? What did your neighbor get?

  2. (Extra) Use the isin function to find all plots that contain PB and PL species in the "surveys" DataFrame. How many records contain these values?

You can use the isin command in Python to query a DataFrame based upon a list of values as follows:

surveys_df[surveys_df['species_id'].isin([listGoesHere])]

Extra Challenges

  1. (Extra) Create a query that finds all rows with a weight value greater than (>) or equal to 0.

  2. (Extra) The ~ symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with sex NOT equal to 'M' or 'F' in the "surveys" data.

Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,

# Set x to 5
x = 5
# What does the code below return?
x > 5
output
False
# How about this?
x == 5
output
True

Extra Challenges - Putting it all together

  1. Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of 'x'. Determine the number of null values in the subset.

  2. Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.

  3. Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column.
# Solution extra challenge 2

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) &
                            surveys_df["weight"] > 0.][["sex", "weight", "site_id"]]

# calculate the mean weight for each site id and sex combination:
stack_selection = stack_selection.groupby(["site_id", "sex"]).mean().unstack()

# Plot inside jupyter notebook
%matplotlib inline

# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)
output

png