Reusable and modular code with functions
Functions
Functions wrap up reusable pieces of code - they help you apply the Do Not Repeat Yourself (DRY) principle.
Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for
loop like the one above every time we needed to do it but that would be time consuming and error prone. A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.
Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.
Every method used in Python (for example, print
) is a function, and the libraries we import (say, pandas
) are a collection of functions. We will only use functions that are housed within the same code that uses them, but it’s also easy to write functions that can be used by different programs.
Functions are declared following this general structure:
def this_is_the_function_name(input_argument1, input_argument2):
# The body of the function is indented
# This function prints the two arguments to screen
print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')
# And returns their product
return input_argument1 * input_argument2
The function declaration starts with the word def
, followed by the function name and any arguments in parenthesis, and ends in a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a return statement at the end.
Let's rewrite this function with shorter (but still informative) names so we don't need to type as much:
def product(a, b):
print('The function arguments are:', a, b, '(this is done inside the function!)')
return a * b
This is how we call the function:
product_of_inputs = product(2, 5)
outputThe function arguments are: 2 5 (this is done inside the function!)
print('Their product is:', product_of_inputs, '(this is done outside the function!)')
outputTheir product is: 10 (this is done outside the function!)
Challenge - Functions
-
Change the values of the input arguments in the function and check its output.
-
Try calling the function by giving it the wrong number of arguments (not 2) or not assigning the function call to a variable (no
product_of_inputs =
).
Bonus challenges
-
Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?).
-
Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?
Say we had some code for taking our survey.csv
data and splitting it out into one file for each year:
# First let's make sure we've read the survey data into a pandas DataFrame.
import pandas as pd
all_data = pd.read_csv("surveys.csv")
this_year = 2002
# Select data for just that year
surveys_year = all_data[all_data.year == this_year]
# Write the new DataFrame to a csv file
filename = 'surveys' + str(this_year) + '.csv'
surveys_year.to_csv(filename)
There are many different "chunks" of this code that we can turn into functions, and we can even create functions that call other functions inside them. Let’s first write a function that separates data for just one year and saves that data to a file:
def year_to_csv(year, all_data):
"""
Writes a csv file for data from a given year.
year --- year for which data is extracted
all_data --- DataFrame with multi-year data
"""
# Select data for the year
surveys_year = all_data[all_data.year == year]
# Write the new DataFrame to a csv file
filename = 'function_surveys' + str(year) + '.csv'
surveys_year.to_csv(filename)
The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation:
?year_to_csv
Signature: year_to_csv(year, all_data)
Docstring:
Writes a csv file for data from a given year.
year --- year for which data is extracted
all_data --- DataFrame with multi-year data
File: ~/devel/python-workshop-base/workshops/docs/modules/notebooks/<ipython-input-16-978149c5937c>
Type: function
# Read the survey data into a pandas DataFrame.
# (if you are jumping in to just this lesson and don't yet have the surveys.csv file yet,
# see the "Data analysis in Python with Pandas" `working_with_data` module)
import pandas as pd
surveys_df = pd.read_csv("surveys.csv")
year_to_csv(2002, surveys_df)
Aside - listing files and the os
module
Google Collaboratory and Juypter Notebooks have a built-in file browser, however, you can list the files and directories in the current directory ("folder") with Python code like:
import os
print(os.listdir())
You'll see a Python list, a bit like:
['surveys.csv','function_surveys2002.csv']
(you may have additional files listed here, generated in previous lessons)
The os module contains, among other things, a bunch of useful functions for working with the filesystem and file paths.
Two other useful examples (hint - these might help in a upcoming challenge):
# This returns True if the file or directory specified exists
os.path.exists('surveys.csv')
# This creates empty (nested) directories based on a path (eg in a 'path' each directory is separated by slashes)
os.makedirs('data/csvs/')
If a directory already exists, os.makedirs
fails and produces an error message (in Python terminology we might say it 'raises an exception' ).
We can avoid this by using os.path.exists
and os.makedirs
together like:
if not os.path.exists('data/csvs/'):
os.makedirs('data/csvs/')
What we really want to do, though, is create files for multiple years without having to request them one by one. Let’s write another function that uses a for
loop over a sequence of years and repeatedly calls the function we just wrote, year_to_csv
:
def create_csvs_by_year(start_year, end_year, all_data):
"""
Writes separate CSV files for each year of data.
start_year --- the first year of data we want
end_year --- the last year of data we want
all_data --- DataFrame with multi-year data
"""
# "end_year" is the last year of data we want to pull, so we loop to end_year+1
for year in range(start_year, end_year+1):
year_to_csv(year, all_data)
Because people will naturally expect that the end year for the files is the last year with data, the for
loop inside the function ends at end_year + 1
. By writing the entire loop into a function, we’ve made a reusable tool for whenever we need to break a large data file into yearly files. Because we can specify the first and last year for which we want files, we can even use this function to create files for a subset of the years available. This is how we call this function:
# Create CSV files, one for each year in the given range
create_csvs_by_year(1977, 2002, surveys_df)
Challenge - More Functions
-
How could you use the function
create_csvs_by_year
to create a CSV file for only one year ? (Hint: think about the syntax for range) -
Modify
year_to_csv
so that it has two additional arguments,output_path
(the path of the directory where the files will be written) andfilename_prefix
(a prefix to be added to the start of the file name). Name your new functionyear_to_csv_at_path
. Eg,def year_to_csv_at_path(year, all_data, output_path, filename_prefix):
. Call your new function to create a new file with a different name in a different directory. ... Hint: You could manually create the target directory before calling the function using the Collaboratory / Jupyter file browser, or for bonus points you could do it in Python inside the function using theos
module. -
Create a new version of the
create_csvs_by_year
function calledcreate_csvs_by_year_at_path
that also takes the additional argumentsoutput_path
andfilename_prefix
. Internallycreate_csvs_by_year_at_path
should pass these values toyear_to_csv_at_path
. Call your new function to create a new set of files with a different name in a different directory. -
Make these new functions return a list of the files they have written. There are many ways you can do this (and you should try them all!): you could make the function print the filenames to screen, or you could use a
return
statement to make the function produce a list of filenames, or you can use some combination of the two. You could also try using theos
library to list the contents of directories.
The functions we wrote demand that we give them a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. Let’s modify the function create_csvs_by_year
so that the start_year
and end_year
default to the full range of the data if they are not supplied by the user.
Arguments can be given default values with an equal sign in the function declaration - we call these 'keyword' arguments. Any argument in the function without a default value (here, all_data
) is a required argument - we call these 'positional' arguments. Positional arguements MUST come before any keyword arguments. Keyword arguments are optional - if you don't include them when calling the function, the default value is used.
def keyword_arg_test(all_data, start_year = 1977, end_year = 2002):
"""
A simple function to demonstrate the use of keyword arguments with defaults !
start_year --- the first year of data we want --- default: 1977
end_year --- the last year of data we want --- default: 2002
all_data --- DataFrame with multi-year data - not actually used
"""
return start_year, end_year
start,end = keyword_arg_test(surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)
start,end = keyword_arg_test(surveys_df)
print('Default values:\t\t\t', start, end)
outputBoth optional arguments: 1988 1993 Default values: 1977 2002
The \t
in the print statements are tabs, used to make the text align and be easier to read.
What if our dataset doesn’t start in 1977 and end in 2002? We can modify the function so that it looks for the ealiest and latest years in the dataset if those dates are not provided. Let's redefine csvs_by_year
:
def csvs_by_year(all_data, start_year = None, end_year = None):
"""
Writes separate CSV files for each year of data. The start year and end year can
be optionally provided, otherwise the earliest and latest year in the dataset are
used as the range.
start_year --- the first year of data we want --- default: None - check all_data
end_year --- the last year of data we want --- default: None - check all_data
all_data --- DataFrame with multi-year data
"""
if start_year is None:
start_year = min(all_data.year)
if end_year is None:
end_year = max(all_data.year)
return start_year, end_year
start,end = csvs_by_year(surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)
start,end = csvs_by_year(surveys_df)
print('Default values:\t\t\t', start, end)
outputBoth optional arguments: 1988 1993 Default values: 1977 2002
The default values of the start_year
and end_year
arguments in this new version of the csvs_by_year
function are now None
. This is a built-in constant in Python that indicates the absence of a value - essentially, that the variable exists in the namespace of the function (the directory of variable names) but that it doesn’t correspond to any existing object.
Challenge - Experimenting with keyword arguments
-
What type of object corresponds to a variable declared as
None
? (Hint: create a variable set to None and use the functiontype()
) -
Compare the behavior of the function
csvs_by_year
when the keyword arguments haveNone
as a default vs. calling the function by supplying (non-default) values to the keyword arguments -
What happens if you only include a value for
start_year
in the function call? Can you write the function call with only a value forend_year
? (Hint: think about how the function must be assigning values to each of the arguments - this is related to the need to put the arguments without default values before those with default values in the function definition!)
Conditionals - if
statements
The body of the test function now has two conditionals (if
statements) that check the values of start_year
and end_year
. if
statements execute a segment of code when some condition is met. They commonly look something like this:
a = 5
if a < 0: # Meets first condition?
# if a IS less than zero
print('a is a negative number')
elif a > 0: # Did not meet first condition. meets second condition?
# if a ISN'T less than zero and IS more than zero
print('a is a positive number')
else: # Met neither condition
# if a ISN'T less than zero and ISN'T more than zero
print('a must be zero!')
outputa is a positive number
Change the value of a
to see how this function works. The statement elif
means “else if”, and all of the conditional statements must end in a colon.
The if
statements in the function csvs_by_year
check whether there is an object associated with the variable names start_year
and end_year
. If those variables are None
, the if
statements return the boolean True
and execute whatever is in their body. On the other hand, if the variable names are associated with some value (they got a number in the function call), the if
statements return False
and do not execute. The opposite conditional statements, which would return True
if the variables were associated with objects (if they had received value in the function call), would be if start_year
and if end_year
.
As we’ve written it so far, the function csvs_by_year
associates values in the function call with arguments in the function definition just based in their order. If the function gets only two values in the function call, the first one will be associated with all_data
and the second with start_year
, regardless of what we intended them to be. We can get around this problem by calling the function using keyword arguments, where each of the arguments in the function definition is associated with a keyword and the function call passes values to the function using these keywords:
start,end = csvs_by_year(surveys_df)
print('Default values:\t\t\t', start, end)
start,end = csvs_by_year(surveys_df, 1988, 1993)
print('No keywords:\t\t\t', start, end)
start,end = csvs_by_year(surveys_df, start_year = 1988, end_year = 1993)
print('Both keywords, in order:\t', start, end)
start,end = csvs_by_year(surveys_df, end_year = 1993, start_year = 1988)
print('Both keywords, flipped:\t\t', start, end)
start,end = csvs_by_year(surveys_df, start_year = 1988)
print('One keyword, default end:\t', start, end)
start,end = csvs_by_year(surveys_df, end_year = 1993)
print('One keyword, default start:\t', start, end)
outputDefault values: 1977 2002 No keywords: 1988 1993 Both keywords, in order: 1988 1993 Both keywords, flipped: 1988 1993 One keyword, default end: 1988 2002 One keyword, default start: 1977 1993
Multiple choice challenge
What output would you expect from the if
statement (try to figure out the answer without running the code):
pod_bay_doors_open = False
dave_want_doors_open = False
hal_insanity_level = 2001
if not pod_bay_doors_open:
print("Dave: Open the pod bay doors please HAL.")
dave_wants_doors_open = True
elif pod_bay_doors_open and hal_insanity_level >= 95:
print("HAL: I'm closing the pod bay doors, Dave.")
if dave_wants_doors_open and not pod_bay_doors_open and hal_insanity_level >= 95:
print("HAL: I’m sorry, Dave. I’m afraid I can’t do that.")
elif dave_wants_doors_open and not pod_bay_doors_open:
print("HAL: I'm opening the pod bay doors, welcome back Dave.")
else:
print("... silence of space ...")
a) "HAL: I'm closing the pod bay doors, Dave.", "... silence of space ..."
b) "Dave: Open the pod bay doors please HAL.", "HAL: I’m sorry, Dave. I’m afraid I can’t do that."
c) "... silence of space ..."
d) "Dave: Open the pod bay doors please HAL.", HAL: "I'm opening the pod bay doors, welcome back Dave."
Bonus Challenge - Modifying functions
-
Rewrite the
year_to_csv
andcsvs_by_year
functions to have keyword arguments with default values. -
Modify the functions so that they don’t create yearly files if there is no data for a given year and display an alert to the user (Hint: use conditional statements to do this. For an extra challenge, use
try
statements !). -
The code below checks to see whether a directory exists and creates one if it doesn’t. Add some code to your function that writes out the CSV files, to check for a directory to write to.
import os
if 'dir_name_here' in os.listdir():
print('Processed directory exists')
else:
os.mkdir('dir_name_here')
print('Processed directory created')
4.
The code that you have written so far to loop through the years is good, however it is not necessarily reproducible with different datasets. For instance, what happens to the code if we have additional years of data in our CSV files? Using the tools that you learned in the previous activities, make a list of all years represented in the data. Then create a loop to process your data, that begins at the earliest year and ends at the latest year using that list.
HINT: you can create a loop with a list as follows: for years in year_list: