Data Analysis with Python
Automating data analysis tasks in Python
We can automate the process of performing data manipulations in Python. It's efficient to spend time building the code to perform these tasks because once it's built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.
The Dataset
For this lesson, we will be using the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA
We will be using this dataset, which can be downloaded here: surveys.csv ... but don't click to download it in your browser - we are going to use Python !
import urllib.request
# You can also get this URL value by right-clicking the `surveys.csv` link above and selecting "Copy Link Address"
url = 'https://monashdatafluency.github.io/python-workshop-base/modules/data/surveys.csv'
# url = 'https://goo.gl/9ZxqBg' # or a shortened version to save typing
urllib.request.urlretrieve(url, 'surveys.csv')
output('surveys.csv',
)
If Jupyter is running locally on your computer, you'll now have a file surveys.csv
in the current working directory.
You can check by clicking on File
tab on the top left of the notebook to see if the file exists. If you are running Jupyter on a remote server or cloud service (eg Colaboratory or Azure Notebooks), the file will be there instead.
We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a .csv
file: each row holds information for a
single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
site_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal ("M", "F") |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
The first few rows of our file look like this:
record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
1,7,16,1977,2,NL,M,32,
2,7,16,1977,3,NL,M,33,
3,7,16,1977,2,DM,F,37,
4,7,16,1977,7,DM,M,36,
5,7,16,1977,3,DM,M,35,
6,7,16,1977,1,PF,M,14,
7,7,16,1977,2,PE,F,,
8,7,16,1977,1,DM,M,37,
9,7,16,1977,1,DM,F,34,
About Libraries
A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform many tasks.
If you have noticed in the previous code import urllib.request
, we are calling
a request function from library urllib to download our dataset from web.
Pandas in Python
The dataset we have, is in table format. One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.
First, lets make sure the Pandas and matplotlib packages are installed.
!pip install pandas matplotlib
outputRequirement already satisfied: pandas in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (0.25.0) Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: pytz>=2017.2 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2019.1) Requirement already satisfied: python-dateutil>=2.6.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (2.8.0) Requirement already satisfied: numpy>=1.13.3 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from pandas) (1.17.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (2.4.1.1) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: six>=1.5 in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/python-workshop-base-ufuVBSbV/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (39.1.0)
Python doesn't load all of the libraries available to it by default. We have to
add an import
statement to our code in order to use library functions. To import
a library, we use the syntax import libraryName
. If we want to give the
library a nickname to shorten the command, we can add as nickNameHere
. An
example of importing the pandas library using the common nickname pd
is below.
import pandas as pd
Each time we call a function that's in a library, we use the syntax
LibraryName.FunctionName
. Adding the library name with a .
before the
function name tells Python where to find the function. In the example above, we
have imported Pandas as pd
. This means we don't have to type out pandas
each
time we call a Pandas function.
Reading CSV Data Using Pandas
We will begin by locating and reading our survey data which are in CSV format. CSV stands for Comma-Separated Values and is a common way store formatted data. Other symbols my also be used, so you might see tab-separated, colon-separated or space separated files. It is quite easy to replace one separator with another, to match your application. The first line in the file often has headers to explain what is in each column. CSV (and other separators) make it easy to share data, and can be imported and exported from many applications, including Microsoft Excel.
We can use Pandas' read_csv
function to pull the file directly into a
DataFrame.
So What's a DataFrame?
A DataFrame is a 2-dimensional data structure that can store data of different
types (including characters, integers, floating point values, factors and more)
in columns. It is similar to a spreadsheet or an SQL table or the data.frame
in
R. A DataFrame always has an index (0-based). An index refers to the position of
an element in the data structure.
# Note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("surveys.csv")
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
20 | 21 | 7 | 17 | 1977 | 14 | DM | F | 34.0 | NaN |
21 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN |
22 | 23 | 7 | 17 | 1977 | 13 | DM | M | 36.0 | NaN |
23 | 24 | 7 | 17 | 1977 | 13 | SH | M | 21.0 | NaN |
24 | 25 | 7 | 17 | 1977 | 9 | DM | M | 35.0 | NaN |
25 | 26 | 7 | 17 | 1977 | 15 | DM | M | 31.0 | NaN |
26 | 27 | 7 | 17 | 1977 | 15 | DM | M | 36.0 | NaN |
27 | 28 | 7 | 17 | 1977 | 11 | DM | M | 38.0 | NaN |
28 | 29 | 7 | 17 | 1977 | 11 | PP | M | NaN | NaN |
29 | 30 | 7 | 17 | 1977 | 10 | DS | F | 52.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35519 | 35520 | 12 | 31 | 2002 | 9 | SF | NaN | 24.0 | 36.0 |
35520 | 35521 | 12 | 31 | 2002 | 9 | DM | M | 37.0 | 48.0 |
35521 | 35522 | 12 | 31 | 2002 | 9 | DM | F | 35.0 | 45.0 |
35522 | 35523 | 12 | 31 | 2002 | 9 | DM | F | 36.0 | 44.0 |
35523 | 35524 | 12 | 31 | 2002 | 9 | PB | F | 25.0 | 27.0 |
35524 | 35525 | 12 | 31 | 2002 | 9 | OL | M | 21.0 | 26.0 |
35525 | 35526 | 12 | 31 | 2002 | 8 | OT | F | 20.0 | 24.0 |
35526 | 35527 | 12 | 31 | 2002 | 13 | DO | F | 33.0 | 43.0 |
35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
35528 | 35529 | 12 | 31 | 2002 | 13 | PB | F | 25.0 | 25.0 |
35529 | 35530 | 12 | 31 | 2002 | 13 | OT | F | 20.0 | NaN |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35531 | 35532 | 12 | 31 | 2002 | 14 | DM | F | 34.0 | 43.0 |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
35537 | 35538 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 31.0 |
35538 | 35539 | 12 | 31 | 2002 | 15 | SF | M | 26.0 | 68.0 |
35539 | 35540 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 23.0 |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
The above command outputs a DateFrame
object, which Jupyter displays as a table (snipped in the middle since there are many rows).
We can see that there were 33,549 rows parsed. Each row has 9
columns. The first column is the index of the DataFrame. The index is used to
identify the position of the data, but it is not an actual column of the DataFrame.
It looks like the read_csv
function in Pandas read our file properly. However,
we haven't saved any data to memory so we can work with it.We need to assign the
DataFrame to a variable. Remember that a variable is a name for a value, such as x
,
or data
. We can create a new object with a variable name by assigning a value to it using =
.
Let's call the imported survey data surveys_df
:
surveys_df = pd.read_csv("surveys.csv")
Notice when you assign the imported DataFrame to a variable, Python does not
produce any output on the screen. We can view the value of the surveys_df
object by typing its name into the cell.
surveys_df
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
20 | 21 | 7 | 17 | 1977 | 14 | DM | F | 34.0 | NaN |
21 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN |
22 | 23 | 7 | 17 | 1977 | 13 | DM | M | 36.0 | NaN |
23 | 24 | 7 | 17 | 1977 | 13 | SH | M | 21.0 | NaN |
24 | 25 | 7 | 17 | 1977 | 9 | DM | M | 35.0 | NaN |
25 | 26 | 7 | 17 | 1977 | 15 | DM | M | 31.0 | NaN |
26 | 27 | 7 | 17 | 1977 | 15 | DM | M | 36.0 | NaN |
27 | 28 | 7 | 17 | 1977 | 11 | DM | M | 38.0 | NaN |
28 | 29 | 7 | 17 | 1977 | 11 | PP | M | NaN | NaN |
29 | 30 | 7 | 17 | 1977 | 10 | DS | F | 52.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35519 | 35520 | 12 | 31 | 2002 | 9 | SF | NaN | 24.0 | 36.0 |
35520 | 35521 | 12 | 31 | 2002 | 9 | DM | M | 37.0 | 48.0 |
35521 | 35522 | 12 | 31 | 2002 | 9 | DM | F | 35.0 | 45.0 |
35522 | 35523 | 12 | 31 | 2002 | 9 | DM | F | 36.0 | 44.0 |
35523 | 35524 | 12 | 31 | 2002 | 9 | PB | F | 25.0 | 27.0 |
35524 | 35525 | 12 | 31 | 2002 | 9 | OL | M | 21.0 | 26.0 |
35525 | 35526 | 12 | 31 | 2002 | 8 | OT | F | 20.0 | 24.0 |
35526 | 35527 | 12 | 31 | 2002 | 13 | DO | F | 33.0 | 43.0 |
35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
35528 | 35529 | 12 | 31 | 2002 | 13 | PB | F | 25.0 | 25.0 |
35529 | 35530 | 12 | 31 | 2002 | 13 | OT | F | 20.0 | NaN |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35531 | 35532 | 12 | 31 | 2002 | 14 | DM | F | 34.0 | 43.0 |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
35537 | 35538 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 31.0 |
35538 | 35539 | 12 | 31 | 2002 | 15 | SF | M | 26.0 | 68.0 |
35539 | 35540 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 23.0 |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
which prints contents like above.
You can also select just a few rows, so it is easier to fit on one window, you can see that pandas has neatly formatted the data to fit our screen.
Here, we will be using a function called head.
The head()
function displays the first several lines of a file. It is discussed below.
surveys_df.head()
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
Exploring Our Species Survey Data
Again, we can use the type
function to see what kind of thing surveys_df
is:
type(surveys_df)
outputpandas.core.frame.DataFrame
As expected, it's a DataFrame (or, to use the full name that Python uses to refer
to it internally, a pandas.core.frame.DataFrame
).
What kind of things does surveys_df
contain? DataFrames have an attribute
called dtypes
that answers this:
surveys_df.dtypes
outputrecord_id int64 month int64 day int64 year int64 site_id int64 species_id object sex object hindfoot_length float64 weight float64 dtype: object
All the values in a single column have the same type. For example, months have type
int64
, which is a kind of integer. Cells in the month column cannot have
fractional values, but the weight and hindfoot_length columns can, because they
have type float64
. The object
type doesn't have a very helpful name, but in
this case it represents strings (such as 'M' and 'F' in the case of sex).
Useful Ways to View DataFrame objects in Python
There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.
To access an attribute, use the DataFrame object name followed by the attribute
name df_object.attribute
. Using the DataFrame surveys_df
and attribute
columns
, an index of all the column names in the DataFrame can be accessed
with surveys_df.columns
.
Methods are called in a similar fashion using the syntax df_object.method()
.
As an example, surveys_df.head()
gets the first few rows in the DataFrame
surveys_df
using the head()
method. With a method, we can supply extra
information in the parens to control behaviour.
Let's look at the data using these.
Challenge - DataFrames
Using our DataFrame surveys_df
, try out the attributes & methods below to see
what they return.
surveys_df.columns
surveys_df.shape
Take note of the output ofshape
- what format does it return the shape of the DataFrame in? HINT: More on tuples, here.surveys_df.head()
Also, what doessurveys_df.head(15)
do?surveys_df.tail()
Calculating Statistics From Data
We've read our data into Python. Next, let's perform some quick summary statistics to learn more about the data that we're working with. We might want to know how many animals were collected in each plot, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.
Let's begin by exploring our data:
# Look at the column names
surveys_df.columns
outputIndex(['record_id', 'month', 'day', 'year', 'site_id', 'species_id', 'sex', 'hindfoot_length', 'weight'], dtype='object')
Let's get a list of all the species. The pd.unique
function tells us all of
the unique values in the species_id
column.
pd.unique(surveys_df['species_id'])
outputarray(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS', 'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ', 'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS', 'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX', 'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
Challenge - Statistics
-
Create a list of unique site ID's found in the surveys data. Call it
site_names
. How many unique sites are there in the data? How many unique species are in the data? -
What is the difference between
len(site_names)
andsurveys_df['site_id'].nunique()
?
Groups in Pandas
We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.
We can calculate basic statistics for all records in a single column using the syntax below:
surveys_df['weight'].describe()
outputcount 32283.000000 mean 42.672428 std 36.631259 min 4.000000 25% 20.000000 50% 37.000000 75% 48.000000 max 280.000000 Name: weight, dtype: float64
We can also extract one specific metric if we wish:
surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
# only the last command shows output below - you can try the others above in new cells
surveys_df['weight'].count()
output32283
But if we want to summarize by one or more variables, for example sex, we can
use Pandas' .groupby
method. Once we've created a groupby DataFrame, we
can quickly calculate summary statistics by a group of our choice.
# Group data by sex
grouped_data = surveys_df.groupby('sex')
The pandas function describe
will return descriptive stats including: mean,
median, max, min, std and count for a particular column in the data. Note Pandas'
describe
function will only return summary values for columns containing
numeric data.
# Summary statistics for all numeric columns by sex
grouped_data.describe()
# Provide the mean for each numeric column by sex
# As above, only the last command shows output below - you can try the others above in new cells
grouped_data.mean()
record_id | month | day | year | site_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
sex | |||||||
F | 18036.412046 | 6.583047 | 16.007138 | 1990.644997 | 11.440854 | 28.836780 | 42.170555 |
M | 17754.835601 | 6.392668 | 16.184286 | 1990.480401 | 11.098282 | 29.709578 | 42.995379 |
The groupby
command is powerful in that it allows us to quickly generate
summary stats.
Challenge - Summary Data
-
How many recorded individuals are female
F
and how many maleM
- A) 17348 and 15690
- B) 14894 and 16476
- C) 15303 and 16879
- D) 15690 and 17348
-
What happens when you group by two columns using the following syntax and then grab mean values:
grouped_data2 = surveys_df.groupby(['site_id','sex'])
grouped_data2.mean()
-
Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data
by_site['weight'].describe()
Quickly Creating Summary Counts in Pandas
Let's next count the number of samples for each species. We can do this in a few
ways, but we'll use groupby
combined with a count()
method.
# Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)
outputspecies_id AB 303 AH 437 AS 2 BA 46 CB 50 CM 13 CQ 16 CS 1 CT 1 CU 1 CV 1 DM 10596 DO 3027 DS 2504 DX 40 NL 1252 OL 1006 OT 2249 OX 12 PB 2891 PC 39 PE 1299 PF 1597 PG 8 PH 32 PI 9 PL 36 PM 899 PP 3123 PU 5 PX 6 RF 75 RM 2609 RO 8 RX 2 SA 75 SC 1 SF 43 SH 147 SO 43 SS 248 ST 1 SU 5 UL 4 UP 8 UR 10 US 4 ZL 2 Name: record_id, dtype: int64
Or, we can also count just the rows that have the species "DO":
surveys_df.groupby('species_id')['record_id'].count()['DO']
output3027
Basic Math Functions
If we wanted to, we could perform math on an entire column of our data. For example let's multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.
# Multiply all weight values by 2 but does not change the original weight data
surveys_df['weight']*2
output0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... 35544 NaN 35545 NaN 35546 28.0 35547 102.0 35548 NaN Name: weight, Length: 35549, dtype: float64
Quick & Easy Plotting Data Using Pandas
We can plot our summary stats using Pandas, too.
## To make sure figures appear inside Jupyter Notebook
%matplotlib inline
# Create a quick bar chart
species_counts.plot(kind='bar')
output
Animals per site plot
We can also look at how many animals were captured in each site.
total_count = surveys_df.groupby('site_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar')
output
Extra Plotting Challenge
-
Create a plot of average weight across all species per plot.
-
Create a plot of total males versus total females for the entire dataset.
-
Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge: For more on Pandas plots, visit this link.