Starting out in R, we tend to use it interactively, running one command at a time. In this section we will look at writing R code that performs repetitive tasks using “for-loops”, packages up commonly needed sequences of operations in “functions”, and makes decisions using “if-statements”.

Some comments on programming to aid in setting context when learning to program. What do computers do? They perform very simple instructions from a long list, one after the other. The instructions may be to:

  1. Load or store a number from main memory.

  2. Combine numbers, such as by performing arithemtic or making comparisons.

  3. Jump somewhere else in the list of instructions if a condition is met.

This is called “machine code” and would be annoying to work with directly, so we use programming languages to specify what we want the computer to do in a more structured way. A single instruction in a programming language might translate into hundreds of simple instructions that together do something meaningful, and rather than jumping around a single long list of instructions we write code using structures like for-loops, functions, and if-statements.

Learning to use a specific programming language includes learning to read its syntax – the use of brackets, keywords, etc – and how to use the specific structures that it provides.

As you gain experience you will build a mental library of patterns or idioms - small collections of code and concepts that you can apply, with minor modifications, to many problems. Educators call this algorithmic thinking. We’ll introduce a few patterns that we’ve found useful in a variety of applications.

While R is known as a statistical environment, it is built around a general purpose programming language. This is part of what has made so many extensions possible. The big gains of learning to use this language are especially evident in the early stages of a data analysis problem - importing, cleaning and visualizing the data.

We are using a couple of packages from CRAN in this tutorial, which we can install with install.packages. If you did this while setting up for this workshop, you won’t need to do it again now.

# Install the entire Tidyverse collection of packages with:
#
#   install.packages("tidyverse") 
#
# or just install packages needed for this section with:
#
#   install.packages(c(
#       "readr",    # read tabular data
#       "dplyr"     # general data frame manipulation
#   ))

If-statements

One fundamental programming concept is the “if-statement”. An if-statement lets us do something only if some logical value is TRUE, or do one thing if it is TRUE and another if it is FALSE.

In the introductory R day, we used logical vectors to query a data frame. Everything we learned about making and manipulating logical vectors is applicable here, but with the restriction that we need a single logical value, i.e. a logical vector of length 1.

The general pattern of an if statement is:

if (LOGICAL_VALUE) {
    THING_TO_DO_IF_TRUE
} else {
    THING_TO_DO_IF_FALSE
}

Here is an example:

num <- 37                   # 1
if (num > 100) {            #   2
  cat("greater\n")          #
} else {                    #
  cat("not greater\n")      #     3
}                           #       
cat("done\n")               #       4
                            # --time-->
## not greater
## done

The second line of this code uses an if-statement to tell R that we want to make a choice. If the following test is TRUE, the body of the if (i.e., the lines in the curly braces underneath it) are executed. If the test is FALSE, the body of the else is executed instead. Only one or the other is ever executed.

In the example above, the test num > 100 returns the value FALSE, which is why the code inside the if block was skipped and the code inside the else block was run instead.

If-statements don’t have to include an else. If there isn’t one, R simply does nothing if the test is FALSE:

num <- 53                            # 1
if (num > 100) {                     #   2
    cat("num is greater than 100\n") #
}                                    #     
cat("done\n")                        #     3
                                     # --time-->
## done

We can also chain several tests together when there are more than two options. As an example, here is some code that works out the sign of a number:

if (num > 0) {         # line 1
    print(1)           # line 2
} else if (num == 0) { # line 3
    print(0)           # line 4
} else {                  
    print(-1)          # line 5
}                         
## [1] 1
num <- -3
num <- 0
num <- 2/3

Quiz

Which lines above are executed, and in what order, for each value of num?

Functions

Functions are blocks of code stored in a special structure. They are modelled on the concept of a mathematical function - they receive inputs (arguments), perform operations on the inputs, and return results.

Starting out in R mostly involves using functions that other people had written. We’re now going to see how to create our own functions.

The general pattern of a function is:

FUNCTION_NAME <- function(ARGUMENT_NAME1, ARGUMENT_NAME2, ...) {
    FUNCTION_BODY
}

Here is an example:

fahr_to_kelvin <- function(temp) {
    (temp-32) * (5/9) + 273.15
}

We define fahr_to_kelvin by assigning it a function. The list of argument names are containted within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}). The statements in the body have been indented by four spaces, which makes the code easier to read but does not affect how the code operates.

Note that there is a very important difference between defining a function and calling, or running a function.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. The expression inside the function is evaluated and the result is “returned” back to whoever asked for it.

Let’s try running our function. Calling our own function is no different from calling any other function:

# freezing point of water
fahr_to_kelvin(32)
## [1] 273.15
# boiling point of water
fahr_to_kelvin(212)
## [1] 373.15

Variations

Here are some other equivalent ways to write this function.

The braces are not actually needed if the function only contains one statement.

fahr_to_kelvin <- function(temp) (temp-32) * (5/9) + 273.15

The body of the function within the {} can contain multiple lines of code, “statements”. Only the last line is returned.

fahr_to_kelvin <- function(temp) {
    kelvin <- (temp-32) * (5/9) + 273.15
    kelvin
}

return can be used to return from a function immediately. Further code will not be run.

fahr_to_kelvin <- function(temp) {
    kelvin <- (temp-32) * (5/9) + 273.15
    return(kelvin)
    plot(1:10)
}

Statements in the function may be split over several lines, so long as it is clear that the statement is incomplete due to an unclosed bracket or an operator in need of a right hand argument.

fahr_to_kelvin <- function(temp) {
    kelvin <- 
        (temp-32) * (5/9) + 
        273.15
    return(kelvin)
}

Spacing and layout is largely for our own sanity.

   fahr_to_kelvin<- 
function(
                   temp){(temp 
  -32  )*(    5 
            /9 )+ 

273.15}

Beware! however of accidentally finishing a statement early.

fahr_to_kelvin_broken <- function(temp) {
    (temp-32) * (5/9) 
        + 273.15
}

fahr_to_kelvin_broken(212)
## [1] 273.15

The first line in the body of the function is computed then discarded. The second line, + 273.15, is valid R so we don’t even get an error.

Finally, the computer doesn’t understand or care what names we give to functions, arguments, or variables.

charmander <- function(bulbasaur) {
    mew <- (bulbasaur-32) * (5/9) + 273.15
    mew
}

Composing Functions

Now that we’ve seen how to turn Fahrenheit into Kelvin, it’s easy to turn Kelvin into Celsius:

kelvin_to_celsius <- function(temp) {
    temp - 273.15
}

#absolute zero in Celsius
kelvin_to_celsius(0)
## [1] -273.15

What about converting Fahrenheit to Celsius? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:

fahr_to_celsius <- function(temp) {
    temp_k <- fahr_to_kelvin(temp)
    temp_c <- kelvin_to_celsius(temp_k)
    temp_c
}

# freezing point of water in Celsius
fahr_to_celsius(32.0)
## [1] 0

This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here–typically half a dozen to a few dozen lines–but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.

Function names and argument names are important cues to what code is doing.

Functions also help manage code complexity by cleaning up their contents after completion. You might have noticed that all of these functions have an argument called temp. Why hasn’t this caused confusion and chaos? The answer is that arguments to a function and variables defined in the body of a function are local to a call to that function. Within the body of a function, R code can see variables from its own local environment, and variables from the global environment, but not variables local to other function calls. While running code R maintains a stack of local environments. When a call to a function returns, all of its local arguments and variables disappear.

Tip: There are several ways to see this in action.

  • Scatter cats through your functions showing the values of arguments and variables at various points.

  • Insert a call to browser() in one of your functions. This will pause execution and let you interact with the local environment in a function. Once in the browser, type “help” for a list of commands. Besides examining variables, you can step through the function from the point it paused with “n” or, at a finer grain, step into functions it calls with “s”. When you are done type “f”.

  • You can debug an existing function using debugonce. This is like temporarily adding a call to browser() at the top of the function.

# debugging a function
debugonce(fahr_to_celsius)
fahr_to_celsius(212)

Challenge

Write a function to calculate the length of the hypotenuse of a right angled triangle using Pythagorus’s rule, given the lengths of the other sides.

Hint: sqrt calculates the square root of a number.

Testing your code is important. Invent a test case for your code consisting of:

  • The input arguments to your function.
  • The return value you expect.

Confirm that your function works as expected.

For-loops

There are three common approaches to repetitive tasks in R:

  1. Copy and paste the code a few times. (Don’t do this.)
  2. Write a for-loop.
  3. Write a function, and give this as an argument to an existing function that will call it repeatedly.

Method 1 is error prone, and also hard to fix if you find a mistake in your code.

We will now look at method 2, using for-loops.

The general pattern of a loop is:

for(VARIABLE_NAME in VECTOR) {
  FOR_LOOP_BODY
}

Let’s look at an example. Suppose we have a calculation we need to perform on a series of numbers.

i <- 10
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 20
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 30
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 40
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 50
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
## i is 10 
## i squared is 100 
## i is 20 
## i squared is 400 
## i is 30 
## i squared is 900 
## i is 40 
## i squared is 1600 
## i is 50 
## i squared is 2500

Apart from the i <- ... lines, the code is the same each time. This can be re-written as a for-loop:

for(i in c(10,20,30,40,50)) {       #    1
    cat("i is",i,"\n")              #      2   4   6   8   10
    cat("i squared is",i*i,"\n")    #        3   5   7   9    11
}                                   #
                                    #
cat("done\n")                       #                            12
                                    #   --order-of-execution-->
## i is 10 
## i squared is 100 
## i is 20 
## i squared is 400 
## i is 30 
## i squared is 900 
## i is 40 
## i squared is 1600 
## i is 50 
## i squared is 2500 
## done

We can name the loop variable anything we like. in is part of the for syntax. The body of the loop is enclosed in curly braces { }. For a single-line loop body the braces aren’t needed, but it is good practice to always include them.

The loop variable i is assigned each element in the vector in turn, and then the loop body runs. There’s no possibility here of making a mistake when copying and pasting the code, and if we later discover we need to change this code, the for-loop version will be much easier to change. It will also be possible to run this code a different number of times by changing the vector used.

Challenge

As a small group, compare your answers as you go.

  1. Write down what these lines of code do in English. Check they do what you expect by running them in R. Try adding cats or prints to the for-loop body to check what is going on.
myvec <- c(10,20,30,40)

total <- 0
for(item in myvec) {
    total <- total + item
}

total
  1. Write down what these lines of code do in English. How could this be changed to work with any length of myvec?
myvec <- c(10,20,30,40)

for(index in 1:4) {
    myvec[index] <- myvec[index] * 2
}
  1. Write the steps to calculate n factorial in English, i.e. 1*2*3* ... *n.

  2. Write R code to calculate 10 factorial.

numbers <- 1:10

... your code here ...

A practical programming exercise

Let’s say we want to examine the quality of some FASTQ files, which contain reads from a DNA sequencing machine. An experiment has been performed over several days, and we want to run a program called “fastqc” on each of the FASTQ files.

FASTQ is a text format containg a series of DNA sequences and associated quality information. Examine “Day0.fastq” in the “r-progtidy-files” folder.

(This data is a very small portion of the reads from an experiment inducing pluripotency in a mouse cell line, GSE70022.)

Running external software

Software outside of R is normally run from a “terminal” window, which is rather like the console in RStudio. You type in a command and the operating system runs it for you. The commands you type in are interpreted by something called a “shell”, because it acts like a shell around the operating system. For more information, see the Software Carpentry course on the Unix shell.

In RStudio, you can open a terminal using “Tools/Terminal/New Terminal”. Try this and type in:

uptime

The “uptime” command tells you how long the computer has been running for. R can give a command to the shell using the system function.

system("uptime")

Here is how we want to run FastQC for day 0. To R the command has no meaning beyond being a character string. It will be interpreted by the shell, which is external to R. The shell will then run FastQC for us.

system("FastQC/fastqc --extract --outdir . r-progtidy-files/Day0.fastq")

In the “Files” tab of the bottom right pane in RStudio, you should see that some new files and a directory have been created. Click on “Day0_fastqc.html” to examine it. Also examine the file in the “Day0_fastqc” folder called “summary.txt”.

If you don’t have FastQC, or it failed to run for some reason: The expected output files can be found in the folder “r-progtidy-files/fastqc-output”.

Using a for-loop

Typing out the fastqc command for each day will get repetitive. Let’s use a for-loop to automate this task. The paste0 function lets us “paste” together character strings, so we can use that to construct the commands to run, like this:

# construct a command to run
day <- 0
command <- paste0("FastQC/fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq")
command
## [1] "FastQC/fastqc --extract --outdir . r-progtidy-files/Day0.fastq"

We want to do this for each day, and then use system to run the resulting command:

# run fastqc on each of the files

days <- c(0,4,7,10,15,20)

for(day in days) {
    command <- paste0("FastQC/fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq")
    cat("Running the command:", command, "\n")
    system(command)
}
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day0.fastq 
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day4.fastq 
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day7.fastq 
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day10.fastq 
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day15.fastq 
## Running the command: FastQC/fastqc --extract --outdir . r-progtidy-files/Day20.fastq

Loading the summary.txt files

Note: If you weren’t able to run FastQC earlier, the expected output files can be found in the folder “r-progtidy-files/fastqc-output”. You will need to adjust the filenames appropriately in the R code below.

The summary.txt files are in “tab separated value” format. This is similar to comma separated value format we’ve seen in the introductory R day, but instead of using a comma it uses a special character called a tab to delimit columns. Tabs show up as variable amounts of space in a text editor. In R, they can be written "\t".

In base R we could use the function read.delim to read this file. It’s quite similar to read.csv. However, let’s use the Tidyverse package readr to load the file. To use readr we first need to load it with library.

library(readr)

We can now read the file like this:

read_tsv("Day0_fastqc/summary.txt")
## Rows: 10 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): PASS, Basic Statistics, Day0.fastq
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 10 × 3
##    PASS  `Basic Statistics`           Day0.fastq
##    <chr> <chr>                        <chr>     
##  1 PASS  Per base sequence quality    Day0.fastq
##  2 PASS  Per tile sequence quality    Day0.fastq
##  3 PASS  Per sequence quality scores  Day0.fastq
##  4 WARN  Per base sequence content    Day0.fastq
##  5 WARN  Per sequence GC content      Day0.fastq
##  6 PASS  Per base N content           Day0.fastq
##  7 PASS  Sequence Length Distribution Day0.fastq
##  8 PASS  Sequence Duplication Levels  Day0.fastq
##  9 WARN  Overrepresented sequences    Day0.fastq
## 10 PASS  Adapter Content              Day0.fastq

Oh! There aren’t column headings in this file. We need to tell read_tsv this.

read_tsv("Day0_fastqc/summary.txt", col_names=FALSE)
## Rows: 11 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 11 × 3
##    X1    X2                           X3        
##    <chr> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 WARN  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## 11 PASS  Adapter Content              Day0.fastq

Finally, readr is nudging us to specify the column types. Otherwise it has to guess, and sometimes guesses wrong. Not guessing things is especially important for programming as opposed to interactive use of R, as we won’t immediately see the mistake. Have a look at ?read_tsv to learn about using col_types.

read_tsv("Day0_fastqc/summary.txt", col_names=FALSE, col_types="ccc")
## # A tibble: 11 × 3
##    X1    X2                           X3        
##    <chr> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 WARN  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## 11 PASS  Adapter Content              Day0.fastq

We should tidy this up a bit before using it. Columns should have meaningful names, and the PASS/WARN/FAIL grading looks like it should be a factor:

filename <- "Day0_fastqc/summary.txt"
sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc")
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
## # A tibble: 11 × 3
##    grade test                         file      
##    <fct> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 WARN  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## 11 PASS  Adapter Content              Day0.fastq

We expect to have to examine many FastQC reports in future. It will be convenient to have this as a function!

filename is going to be different each time we use this code, so it wants to be an argument to the function. The rest of the code goes in the body of the function. The returned value will be sumtab.

load_fastqc <- function(filename) {
    sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc")
    colnames(sumtab) <- c("grade", "test", "file")
    sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
    sumtab
}

load_fastqc("Day0_fastqc/summary.txt")
## # A tibble: 11 × 3
##    grade test                         file      
##    <fct> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 WARN  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## 11 PASS  Adapter Content              Day0.fastq

Applying the function

We now want to load each of the summary.txt files we created earlier. To do this we’re going to use an “apply”-style function. “apply”-style functions are functions that call some other function repeatedly on subsets of a data set. Some examples are apply, lapply, vapply, and tapply. apply itself is for use with matrices and usually produces a vector as output. Other functions in this family have different input and output types. Writing code that makes use of this type of function is called “functional programming”.

The flavour of apply we need here is lapply, which takes a vector and calls a function on each element in turn of that vector (i.e. the “subsets” here are simply individual elements). The result is a list containing the returned values, hence the name lapply.

First let’s work out the names of the files we want to load.

days <- c(0,4,7,10,15,20)
filenames <- paste0("Day", days, "_fastqc/summary.txt")
filenames
## [1] "Day0_fastqc/summary.txt"  "Day4_fastqc/summary.txt" 
## [3] "Day7_fastqc/summary.txt"  "Day10_fastqc/summary.txt"
## [5] "Day15_fastqc/summary.txt" "Day20_fastqc/summary.txt"

Now we can use lapply to apply our function to each of the filenames.

sumtabs <- lapply(filenames, load_fastqc)

Using lapply is “functional programming”, where we define a series of objects by using functions on earlier objects. In this approach objects are never modified once created. Sometimes the objects used are themselves functions.

See also: purrr package.

An alternative would be to use a for loop. Using a for loop is “procedural programming”, where the result is modified step by step until we obtain the value we want:

sumtabs <- list()
for(idx in 1:length(filenames)) {
    sumtabs[[idx]] <- load_fastqc(filenames[idx])
}

sumtabs is a list. Lists can contain things of different types and different sizes. Here we have a list of data frames. Lists are a little different from the vectors we usually work with. See section 20.5 in the “R for data science” book for a comprehensive explanation.

Here are some ways to examine this object:

sumtabs
class(sumtabs)
length(sumtabs)
str(sumtabs)

When we print sumtabs on the console, it gives us a hint how to access individual elements: using the [[ ]] syntax:

sumtabs[[1]]
sumtabs[[2]]

It would be nicer if this was a single big data frame. The dplyr package provides a function to do this:

library(dplyr)
bigtab <- bind_rows(sumtabs)

bigtab
## # A tibble: 66 × 3
##    grade test                         file      
##    <fct> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 WARN  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## # … with 56 more rows
table(bigtab$test, bigtab$grade)
##                               
##                                FAIL WARN PASS
##   Adapter Content                 0    0    6
##   Basic Statistics                0    0    6
##   Overrepresented sequences       0    4    2
##   Per base N content              0    0    6
##   Per base sequence content       5    1    0
##   Per base sequence quality       0    0    6
##   Per sequence GC content         2    4    0
##   Per sequence quality scores     0    0    6
##   Per tile sequence quality       0    0    6
##   Sequence Duplication Levels     0    0    6
##   Sequence Length Distribution    0    0    6

Summary

What we’ve just done is:

  1. Create a custom reader tool that sets up each file inside R

  2. Use a looping tool to apply that function to a collection of file names

  3. Transform the results produced by the looping tool into a data frame that we can use for other purposes.

Improving load_fastqc

Our load_fastqc function will currently fail in a confusing way if it is given the wrong file:

load_fastqc("Day0_fastqc/fastqc_data.txt")
## Warning: One or more parsing issues, see `problems()` for details
## # A tibble: 451 × 2
##    grade test                   
##    <fct> <chr>                  
##  1 <NA>  0.11.8                 
##  2 <NA>  pass                   
##  3 <NA>  Value                  
##  4 <NA>  Day0.fastq             
##  5 <NA>  Conventional base calls
##  6 <NA>  Sanger / Illumina 1.9  
##  7 <NA>  1000                   
##  8 <NA>  0                      
##  9 <NA>  101                    
## 10 <NA>  49                     
## # … with 441 more rows

It’s best to stop as soon as something goes wrong, and with an informative error message.

load_fastqc <- function(filename) {
    # Load file
    sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc")

    # Check number of columns
    if (ncol(sumtab) != 3) {
        stop("Wrong number of columns in file!")
    }

    # Make it nicer to work with
    colnames(sumtab) <- c("grade", "test", "file")
    sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
    sumtab
}

load_fastqc("Day0_fastqc/fastqc_data.txt")
## Warning: One or more parsing issues, see `problems()` for details
## Error in load_fastqc("Day0_fastqc/fastqc_data.txt"): Wrong number of columns in file!

Checking input arguments before proceeding is a common pattern when writing R functions.

Sourcing .R files

Having developed some useful functions, we might want to re-use them in a future project or share them with a friend. We can store R code in a file, usually with extension “.R”, and load it with the source function.

Put your load_fastqc function in a file called “fastqc.R”. It uses the readr library, so be sure to load the library in the file as well.

# fastqc.R file should contain:

library(readr)

load_fastqc <- function(filename) {
    # Load file
    sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc")

    # Check number of columns
    if (ncol(sumtab) != 3) {
        stop("Wrong number of columns in file!")
    }

    # Make it nicer to work with
    colnames(sumtab) <- c("grade", "test", "file")
    sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
    sumtab
}
# From the console:

source("fastqc.R", local=TRUE)

Everything in the file runs as though it were typed on the console.

Similarly, if you are going to be doing several different things with a data set, you could write a .R file to load the data into a tidy form, and several other .R scripts to do various different things with it.

Organizing a project

I like to organize large projects into multiple folders:

/raw        Raw data (do not edit!)
/output     Output files
/scripts    R scripts
/reports    Reports in R Markdown
README.txt

Use version control, such as Git + GitHub.

Some tips I’ve gathered:

  • Working on a project is iterative. Sometimes we put code for multiple approaches side by side. Sometimes we use version control features such as branches.

  • Coming back to a project, we’ve regretted not writing down the order things need to be run to fully reproduce outputs and reports, and what the preferred approach was if we tried many things.

  • Some people like workflowr, which automates working on a large project.

  • The here package is useful when you have multiple folders.

See also Software Carpentry’s list of best practices in R: http://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/

See also Data Fluency workshop “Introduction to Data Organization.”

Packages

Packages are the next step up from sourcing .R files. They let you write code that other people can install and then load with library.

Hadley Wickham has a package called devtools that takes a lot of the pain out of package writing. Also useful is the package usethis, which automates creation of a package directory and basic files, and can set up various other parts of a package.

Packages generally contain documentation and example code for all functions, and one or more vignettes describing how to use the package. Function documentation is usually written as specially formatted comments before each function using roxygen2.

library(devtools)
library(usethis)

# Create an empty package template
create_package("mypack", open=FALSE)

# ... Edit mypack/DESCRIPTION file
# ... Write .R files in mypack/R folder
# For example create a file mypack/R/hello.R containing:
# # # # #

#' Greeting function
#'
#' This function displays a greeting message.
#'
#' @examples
#' hello()
#'
#' @export
hello <- function() {
    cat("Hello, world.\n")
}

# # # # #

# Build package documentation, converting inline documentation to .Rd files using roxygen2.
# Update NAMESPACE file (lists functions the package exports for public consumption).
document("mypack")

# Load package. Use this during development.
load_all("mypack")
hello()

# Check for common problems and missing documentation.
# This also automatically runs document( ).
# A CRAN package must pass all checks.
check("mypack")

create_package creates a .Rproj project file for the package. If you open this project, RStudio changes to the package directory, and there is a new “Build” tab. This provides tools that parallel what Hadley’s devtools package provides with load_all and check, you can decide which you prefer! Old-school R developers may also use R CMD commands from the terminal (try typing R -h in the “Terminal” tab).

Packages are most easily distributed by placing them on GitHub or GitLab. They can then be installed by others using the remotes package functions install_github or install_gitlab.

# To install from GitHub:
remotes::install_github("myusername/mypack")

Once a package is mature and well documented, it can be submitted to CRAN or Bioconductor.

CRAN submissions: CRAN is quite strict about packages passing automated checks, and they will also manually review your package. Submission is by uploading the package tarball using a web form. Each new version also gets a brief manual review.

Bioconductor submissions: Bioconductor is for packages that deal with large biology-related datasets. Packages should make use of existing Bioconductor data types and infrastructure where possible. Bioconductor has some additional automated checks you will need to pass. The initial review process is more intensive than CRAN’s and happens in GitHub, so some familiarity with git will be necessary. Once the package is accepted, changes can be pushed to the Bioconductor git repositories without further manual review. You are also expected to be available to answer questions on their Stack Overflow-style support site and subscribe to the developer mailing list.

# (Recording the R version and versions of packages used improves reproducibility.)
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.10 readr_2.1.2 
## 
## loaded via a namespace (and not attached):
##  [1] pillar_1.8.1     bslib_0.4.0      compiler_4.2.1   jquerylib_0.1.4 
##  [5] tools_4.2.1      digest_0.6.30    bit_4.0.4        jsonlite_1.8.3  
##  [9] evaluate_0.16    lifecycle_1.0.2  tibble_3.1.8     pkgconfig_2.0.3 
## [13] rlang_1.0.6      DBI_1.1.3        cli_3.4.1        yaml_2.3.6      
## [17] parallel_4.2.1   xfun_0.33        fastmap_1.1.0    stringr_1.4.1   
## [21] knitr_1.40       generics_0.1.3   vctrs_0.4.2      sass_0.4.2      
## [25] hms_1.1.2        bit64_4.0.5      tidyselect_1.1.2 glue_1.6.2      
## [29] R6_2.5.1         fansi_1.0.3      vroom_1.5.7      rmarkdown_2.16  
## [33] tzdb_0.3.0       purrr_0.3.4      magrittr_2.0.3   ellipsis_0.3.2  
## [37] htmltools_0.5.3  assertthat_0.2.1 utf8_1.2.2       stringi_1.7.8   
## [41] cachem_1.0.6     crayon_1.5.1