Indexing, Slicing and Subsetting
In this lesson, we will explore ways to access different parts of the data in a Pandas DataFrame using:
- Indexing,
- Slicing, and
- Subsetting
Ensure the Pandas package is installed
!pip install pandas matplotlib
outputRequirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0) Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0) Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1) Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1) Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)
Loading our data
We will continue to use the surveys dataset that we worked with in the last lesson. Let's reopen and read in the data again:
# Make sure pandas is loaded
import pandas as pd
# Read in the survey CSV
surveys_df = pd.read_csv("surveys.csv")
Indexing and Slicing in Python
We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.
Selecting data using Labels (Column Headings)
We use square brackets []
to select a subset of an Python object. For example,
we can select all data from a column named species_id
from the surveys_df
DataFrame by name. There are two ways to do this:
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id'].head()
output0 NL 1 NL 2 DM 3 DM 4 DM Name: species_id, dtype: object
# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id.head()
output0 NL 1 NL 2 DM 3 DM 4 DM Name: species_id, dtype: object
We can also create a new object that contains only the data within the
species_id
column as follows:
# Creates an object, surveys_species, that only contains the `species_id` column
surveys_species = surveys_df['species_id']
We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.
NOTE: If a column name is not contained in the DataFrame, an exception (error) will be raised.
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'site_id']].head()
species_id | site_id | |
---|---|---|
0 | NL | 2 |
1 | NL | 3 |
2 | DM | 2 |
3 | DM | 7 |
4 | DM | 3 |
What happens if you ask for a column that doesn't exist?
surveys_df['speciess']
Outputs:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2392 try:
-> 2393 return self._engine.get_loc(key)
2394 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()
KeyError: 'speciess'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-7-7d65fa0158b8> in <module>()
1
2 # What happens if you ask for a column that doesn't exist?
----> 3 surveys_df['speciess']
4
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
2060 return self._getitem_multilevel(key)
2061 else:
-> 2062 return self._getitem_column(key)
2063
2064 def _getitem_column(self, key):
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2067 # get column
2068 if self.columns.is_unique:
-> 2069 return self._get_item_cache(key)
2070
2071 # duplicate columns & possible reduce dimensionality
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1532 res = cache.get(item)
1533 if res is None:
-> 1534 values = self._data.get(item)
1535 res = self._box_item_values(item, values)
1536 cache[item] = res
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
/Applications/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2393 return self._engine.get_loc(key)
2394 except KeyError:
-> 2395 return self._engine.get_loc(self._maybe_cast_indexer(key))
2396
2397 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()
KeyError: 'speciess'
Python tells us what type of error it is in the traceback, at the bottom it says KeyError: 'speciess'
which means that speciess
is not a column name (or Key in the related python data type dictionary).
# What happens when you flip the order?
surveys_df[['site_id', 'species_id']].head()
site_id | species_id | |
---|---|---|
0 | 2 | NL |
1 | 3 | NL |
2 | 2 | DM |
3 | 7 | DM |
4 | 3 | DM |
Extracting Range based Subsets: Slicing
REMINDER: Python Uses 0-based Indexing
Let's remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position 0. This is different from other tools like R and Matlab that index elements within objects starting at 1.
# Create a list of numbers:
a = [1, 2, 3, 4, 5]
Challenge - Extracting data
-
What value does the code a[0] return?
-
How about this: a[5]
-
In the example above, calling
a[5]
returns an error. Why is that? -
What about a[len(a)] ?
Solutions - Extracting data
Slicing Subsets of Rows in Python
Slicing using the []
operator selects a set of rows and/or columns from a
DataFrame. To slice out a set of rows, you use the following syntax:
data[start:stop]
. When slicing in pandas the start bound is included in the
output. The stop bound is one step BEYOND the row you want to select. So if you
want to select rows 0, 1 and 2 your code would look like this with our surveys data
:
# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
The stop bound in Python is different from what you might be used to in languages like Matlab and R.
Now lets select the first 5 rows (rows 0, 1, 2, 3, 4)
.
surveys_df[:5]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
surveys_df[-1:]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
We can also reassign values within subsets of our DataFrame.
Let's create a brand new clean dataframe from the original data CSV file.
surveys_df = pd.read_csv("surveys.csv")
Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.
loc
is primarily label based indexing. Integers may be used but they are interpreted as a label.iloc
is primarily integer based indexing
To select a subset of rows and columns from our DataFrame, we can use the
iloc
method. For example, we can select month, day and year (columns 2, 3
and 4 if we start counting at 1), like this:
iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]
month | day | year | |
---|---|---|---|
0 | 7 | 16 | 1977 |
1 | 7 | 16 | 1977 |
2 | 7 | 16 | 1977 |
Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.
Let's explore some other ways to index and select subsets of data:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
# What does this do?
surveys_df.loc[0, ['species_id', 'site_id', 'weight']]
outputspecies_id NL site_id 2 weight NaN Name: 0, dtype: object
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]
output/Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages/pandas/core/indexing.py:1404: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike return self._getitem_tuple(key)
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 7.0 | 16.0 | 1977.0 | 2.0 | NL | M | 32.0 | NaN |
10 | 11.0 | 7.0 | 16.0 | 1977.0 | 5.0 | DS | F | 53.0 | NaN |
35549 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
NOTE: Labels must be found in the DataFrame or you will get a KeyError
.
Indexing by labels loc
differs from indexing by integers iloc
.
With loc
, the both start bound and the stop bound are inclusive. When using
loc
, integers can be used, but the integers refer to the
index label and not the position. For example, using loc
and select 1:4
will get a different result than using iloc
to select rows 1:4.
We can also select a specific data value using a row and
column location within the DataFrame and iloc
indexing:
# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]
In following iloc example:
surveys_df.iloc[2, 6]
output'F'
Remember that Python indexing begins at 0. So, the index location [2, 6] selects the element that is 3 rows down and 7 columns over in the DataFrame.
Challenge - Range
-
What happens when you execute:
surveys_df[0:1]
surveys_df[:4]
surveys_df[:-1]
-
What happens when you call:
surveys_df.iloc[0:4, 1:4]
Subsetting Data using Criteria
We can also select a subset of our data using criteria. For example, we can
select all rows that have a year value of 2002
:
surveys_df[surveys_df.year == 2002].head()
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33320 | 33321 | 1 | 12 | 2002 | 1 | DM | M | 38.0 | 44.0 |
33321 | 33322 | 1 | 12 | 2002 | 1 | DO | M | 37.0 | 58.0 |
33322 | 33323 | 1 | 12 | 2002 | 1 | PB | M | 28.0 | 45.0 |
33323 | 33324 | 1 | 12 | 2002 | 1 | AB | NaN | NaN | NaN |
33324 | 33325 | 1 | 12 | 2002 | 1 | DO | M | 35.0 | 29.0 |
Or we can select all rows that do not contain the year 2002
:
surveys_df[surveys_df.year != 2002]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
20 | 21 | 7 | 17 | 1977 | 14 | DM | F | 34.0 | NaN |
21 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN |
22 | 23 | 7 | 17 | 1977 | 13 | DM | M | 36.0 | NaN |
23 | 24 | 7 | 17 | 1977 | 13 | SH | M | 21.0 | NaN |
24 | 25 | 7 | 17 | 1977 | 9 | DM | M | 35.0 | NaN |
25 | 26 | 7 | 17 | 1977 | 15 | DM | M | 31.0 | NaN |
26 | 27 | 7 | 17 | 1977 | 15 | DM | M | 36.0 | NaN |
27 | 28 | 7 | 17 | 1977 | 11 | DM | M | 38.0 | NaN |
28 | 29 | 7 | 17 | 1977 | 11 | PP | M | NaN | NaN |
29 | 30 | 7 | 17 | 1977 | 10 | DS | F | 52.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33290 | 33291 | 12 | 15 | 2001 | 23 | PE | M | 20.0 | 18.0 |
33291 | 33292 | 12 | 15 | 2001 | 23 | RM | F | 16.0 | 8.0 |
33292 | 33293 | 12 | 15 | 2001 | 20 | PE | F | 20.0 | 22.0 |
33293 | 33294 | 12 | 15 | 2001 | 20 | SH | M | 25.0 | 43.0 |
33294 | 33295 | 12 | 15 | 2001 | 20 | PB | F | 27.0 | 33.0 |
33295 | 33296 | 12 | 15 | 2001 | 20 | PB | M | 25.0 | 35.0 |
33296 | 33297 | 12 | 15 | 2001 | 20 | RM | M | 16.0 | 11.0 |
33297 | 33298 | 12 | 15 | 2001 | 20 | RM | F | 16.0 | 8.0 |
33298 | 33299 | 12 | 15 | 2001 | 20 | PB | F | 25.0 | 28.0 |
33299 | 33300 | 12 | 15 | 2001 | 20 | PB | F | 26.0 | 30.0 |
33300 | 33301 | 12 | 15 | 2001 | 20 | PB | F | 27.0 | 31.0 |
33301 | 33302 | 12 | 15 | 2001 | 24 | PE | M | 20.0 | 24.0 |
33302 | 33303 | 12 | 15 | 2001 | 24 | PE | M | 20.0 | 23.0 |
33303 | 33304 | 12 | 15 | 2001 | 24 | RM | M | 16.0 | 10.0 |
33304 | 33305 | 12 | 15 | 2001 | 7 | PB | M | 29.0 | 44.0 |
33305 | 33306 | 12 | 15 | 2001 | 7 | OT | M | 19.0 | 21.0 |
33306 | 33307 | 12 | 15 | 2001 | 7 | OT | M | 20.0 | 19.0 |
33307 | 33308 | 12 | 15 | 2001 | 7 | PP | M | 24.0 | 16.0 |
33308 | 33309 | 12 | 16 | 2001 | 3 | NaN | NaN | NaN | NaN |
33309 | 33310 | 12 | 16 | 2001 | 4 | NaN | NaN | NaN | NaN |
33310 | 33311 | 12 | 16 | 2001 | 5 | NaN | NaN | NaN | NaN |
33311 | 33312 | 12 | 16 | 2001 | 6 | NaN | NaN | NaN | NaN |
33312 | 33313 | 12 | 16 | 2001 | 8 | NaN | NaN | NaN | NaN |
33313 | 33314 | 12 | 16 | 2001 | 9 | NaN | NaN | NaN | NaN |
33314 | 33315 | 12 | 16 | 2001 | 10 | NaN | NaN | NaN | NaN |
33315 | 33316 | 12 | 16 | 2001 | 11 | NaN | NaN | NaN | NaN |
33316 | 33317 | 12 | 16 | 2001 | 13 | NaN | NaN | NaN | NaN |
33317 | 33318 | 12 | 16 | 2001 | 14 | NaN | NaN | NaN | NaN |
33318 | 33319 | 12 | 16 | 2001 | 15 | NaN | NaN | NaN | NaN |
33319 | 33320 | 12 | 16 | 2001 | 16 | NaN | NaN | NaN | NaN |
33320 rows × 9 columns
We can define sets of criteria too:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2270 | 2271 | 1 | 15 | 1980 | 8 | DO | M | 35.0 | 53.0 |
2271 | 2272 | 1 | 15 | 1980 | 11 | PF | F | 16.0 | 10.0 |
2272 | 2273 | 1 | 15 | 1980 | 18 | DM | F | 34.0 | 33.0 |
2273 | 2274 | 1 | 15 | 1980 | 11 | DM | M | 38.0 | 37.0 |
2274 | 2275 | 1 | 15 | 1980 | 8 | DO | F | 33.0 | 29.0 |
2275 | 2276 | 1 | 15 | 1980 | 11 | DS | M | 47.0 | 132.0 |
2276 | 2277 | 1 | 15 | 1980 | 8 | PF | M | 15.0 | 8.0 |
2277 | 2278 | 1 | 15 | 1980 | 9 | OT | M | 21.0 | 23.0 |
2278 | 2279 | 1 | 15 | 1980 | 11 | DM | F | 36.0 | 36.0 |
2279 | 2280 | 1 | 15 | 1980 | 21 | OT | F | 20.0 | 21.0 |
2280 | 2281 | 1 | 15 | 1980 | 11 | OL | M | 20.0 | 29.0 |
2281 | 2282 | 1 | 15 | 1980 | 17 | DM | F | 36.0 | 49.0 |
2282 | 2283 | 1 | 15 | 1980 | 11 | OL | M | 21.0 | 23.0 |
2283 | 2284 | 1 | 15 | 1980 | 9 | OL | M | 20.0 | 32.0 |
2284 | 2285 | 1 | 15 | 1980 | 10 | OL | F | 20.0 | 24.0 |
2285 | 2286 | 1 | 15 | 1980 | 11 | DM | M | 38.0 | 47.0 |
2286 | 2287 | 1 | 15 | 1980 | 21 | OT | M | 19.0 | 22.0 |
2287 | 2288 | 1 | 15 | 1980 | 19 | RM | F | 17.0 | 12.0 |
2288 | 2289 | 1 | 15 | 1980 | 20 | DS | F | 52.0 | 150.0 |
2289 | 2290 | 1 | 15 | 1980 | 11 | DM | M | 37.0 | 49.0 |
2290 | 2291 | 1 | 15 | 1980 | 9 | OL | F | 21.0 | 34.0 |
2291 | 2292 | 1 | 15 | 1980 | 12 | DM | F | 35.0 | 40.0 |
2292 | 2293 | 1 | 15 | 1980 | 18 | DS | F | 51.0 | 132.0 |
2293 | 2294 | 1 | 15 | 1980 | 22 | DM | F | 34.0 | 25.0 |
2294 | 2295 | 1 | 15 | 1980 | 9 | OL | M | 21.0 | 36.0 |
2295 | 2296 | 1 | 15 | 1980 | 8 | DO | F | 34.0 | 50.0 |
2296 | 2297 | 1 | 15 | 1980 | 11 | DM | M | 37.0 | 45.0 |
2297 | 2298 | 1 | 15 | 1980 | 17 | DM | M | 35.0 | 47.0 |
2298 | 2299 | 1 | 15 | 1980 | 9 | DM | M | 38.0 | 46.0 |
2299 | 2300 | 1 | 15 | 1980 | 18 | DM | F | 32.0 | 29.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11197 | 11198 | 12 | 8 | 1985 | 4 | DS | M | 45.0 | 129.0 |
11198 | 11199 | 12 | 8 | 1985 | 8 | DM | F | 38.0 | 42.0 |
11199 | 11200 | 12 | 8 | 1985 | 7 | AB | NaN | NaN | NaN |
11200 | 11201 | 12 | 8 | 1985 | 5 | OL | M | 21.0 | 29.0 |
11201 | 11202 | 12 | 8 | 1985 | 9 | DM | F | 35.0 | 39.0 |
11202 | 11203 | 12 | 8 | 1985 | 7 | PE | F | 17.0 | 19.0 |
11203 | 11204 | 12 | 8 | 1985 | 3 | PP | F | 22.0 | 16.0 |
11204 | 11205 | 12 | 8 | 1985 | 5 | DO | M | 37.0 | 56.0 |
11205 | 11206 | 12 | 8 | 1985 | 11 | DM | F | 38.0 | 38.0 |
11206 | 11207 | 12 | 8 | 1985 | 2 | PE | M | 18.0 | 19.0 |
11207 | 11208 | 12 | 8 | 1985 | 8 | DS | F | 50.0 | 120.0 |
11208 | 11209 | 12 | 8 | 1985 | 2 | DO | F | 37.0 | 52.0 |
11209 | 11210 | 12 | 8 | 1985 | 2 | DM | F | 35.0 | 40.0 |
11210 | 11211 | 12 | 8 | 1985 | 13 | DM | M | 37.0 | 45.0 |
11211 | 11212 | 12 | 8 | 1985 | 4 | DS | NaN | NaN | 121.0 |
11212 | 11213 | 12 | 8 | 1985 | 13 | AH | NaN | NaN | NaN |
11213 | 11214 | 12 | 8 | 1985 | 1 | DM | F | 37.0 | 44.0 |
11214 | 11215 | 12 | 8 | 1985 | 2 | NL | F | 32.0 | 160.0 |
11215 | 11216 | 12 | 8 | 1985 | 3 | RM | M | 17.0 | 9.0 |
11216 | 11217 | 12 | 8 | 1985 | 4 | OL | M | 24.0 | 34.0 |
11217 | 11218 | 12 | 8 | 1985 | 9 | DM | F | 36.0 | 39.0 |
11218 | 11219 | 12 | 8 | 1985 | 8 | DM | F | 38.0 | 41.0 |
11219 | 11220 | 12 | 8 | 1985 | 5 | DO | F | 37.0 | 56.0 |
11220 | 11221 | 12 | 8 | 1985 | 13 | AH | NaN | NaN | NaN |
11221 | 11222 | 12 | 8 | 1985 | 7 | AB | NaN | NaN | NaN |
11222 | 11223 | 12 | 8 | 1985 | 4 | DM | M | 36.0 | 40.0 |
11223 | 11224 | 12 | 8 | 1985 | 11 | DM | M | 37.0 | 49.0 |
11224 | 11225 | 12 | 8 | 1985 | 7 | PE | M | 20.0 | 18.0 |
11225 | 11226 | 12 | 8 | 1985 | 1 | DM | M | 38.0 | 47.0 |
11226 | 11227 | 12 | 8 | 1985 | 15 | NaN | NaN | NaN | NaN |
8957 rows × 9 columns
Python Syntax Cheat Sheet
Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the "surveys" data.
- Equals:
==
- Not equals:
!=
- Greater than, less than:
>
or<
- Greater than or equal to
>=
- Less than or equal to
<=
Challenge - Queries
-
Select a subset of rows in the
surveys_df
DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many rows did you end up with? What did your neighbor get? -
(Extra) Use the
isin
function to find all plots that containPB
andPL
species in the "surveys" DataFrame. How many records contain these values?
You can use the isin
command in Python to query a DataFrame based upon a
list of values as follows:
surveys_df[surveys_df['species_id'].isin([listGoesHere])]
Extra Challenges
-
(Extra) Create a query that finds all rows with a weight value greater than (
>
) or equal to 0. -
(Extra) The
~
symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with sex NOT equal to 'M' or 'F' in the "surveys" data.
Using masks to identify a specific condition
A mask can be useful to locate where a particular subset of values exist or
don't exist - for example, NaN, or "Not a Number" values. To understand masks,
we also need to understand BOOLEAN
objects in Python.
Boolean values include True
or False
. For example,
# Set x to 5
x = 5
# What does the code below return?
x > 5
outputFalse
# How about this?
x == 5
outputTrue
Extra Challenges - Putting it all together
-
Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of 'x'. Determine the number of null values in the subset.
-
Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.
- Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column.
# Solution extra challenge 2
# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) &
surveys_df["weight"] > 0.][["sex", "weight", "site_id"]]
# calculate the mean weight for each site id and sex combination:
stack_selection = stack_selection.groupby(["site_id", "sex"]).mean().unstack()
# Plot inside jupyter notebook
%matplotlib inline
# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)
output