::: {.omit} Starting out in R, we tend to use it interactively, running one command at a time. In this section we will look at writing R code that performs repetitive tasks using "for-loops", packages up commonly needed sequences of operations in "functions", and makes decisions using "if-statements". Some comments on programming to aid in setting context when learning to program. What do computers do? They perform very simple instructions from a long list, one after the other. The instructions may be to: 1. Load or store a number from main memory. 2. Combine numbers, such as by performing arithemtic or making comparisons. 3. Jump somewhere else in the list of instructions if a condition is met. This is called "machine code" and would be annoying to work with directly, so we use programming languages to specify what we want the computer to do in a more structured way. A single instruction in a programming language might translate into hundreds of simple instructions that together do something meaningful, and rather than jumping around a single long list of instructions we write code using structures like for-loops, functions, and if-statements. Learning to use a specific programming language includes learning to read its _syntax_ -- the use of brackets, keywords, etc -- and how to use the specific structures that it provides. As you gain experience you will build a mental library of _patterns_ or _idioms_ - small collections of code and concepts that you can apply, with minor modifications, to many problems. Educators call this _algorithmic thinking_. We'll introduce a few patterns that we've found useful in a variety of applications. While R is known as a statistical environment, it is built around a general purpose programming language. This is part of what has made so many extensions possible. The big gains of learning to use this language are especially evident in the early stages of a data analysis problem - importing, cleaning and visualizing the data. We are using a couple of packages from CRAN in this tutorial, which we can install with `install.packages`. If you did this while setting up for this workshop, you won't need to do it again now. ::: ```{r eval=FALSE} # Install the entire Tidyverse collection of packages with: # # install.packages("tidyverse") # # or just install packages needed for this section with: # # install.packages(c( # "readr", # read tabular data # "dplyr" # general data frame manipulation # )) ``` # If-statements ::: {.omit} One fundamental programming concept is the "if-statement". An if-statement lets us do something only if some logical value is `TRUE`, or do one thing if it is `TRUE` and another if it is `FALSE`. In the introductory R day, we used logical vectors to query a data frame. Everything we learned about making and manipulating logical vectors is applicable here, but with the restriction that we need a single logical value, i.e. a logical vector of length 1. ::: The general pattern of an if statement is: ``` if (LOGICAL_VALUE) { THING_TO_DO_IF_TRUE } else { THING_TO_DO_IF_FALSE } ``` Here is an example: ```{r, results="hold"} num <- 37 # 1 if (num > 100) { # 2 cat("greater\n") # } else { # cat("not greater\n") # 3 } # cat("done\n") # 4 # --time--> ``` ::: {.omit} The second line of this code uses an if-statement to tell R that we want to make a choice. If the following test is `TRUE`, the body of the `if` (i.e., the lines in the curly braces underneath it) are executed. If the test is `FALSE`, the body of the `else` is executed instead. Only one or the other is ever executed. In the example above, the test `num > 100` returns the value `FALSE`, which is why the code inside the `if` block was skipped and the code inside the `else` block was run instead. ::: If-statements don't have to include an `else`. If there isn't one, R simply does nothing if the test is FALSE: ```{r results="hold"} num <- 53 # 1 if (num > 100) { # 2 cat("num is greater than 100\n") # } # cat("done\n") # 3 # --time--> ``` We can also chain several tests together when there are more than two options. As an example, here is some code that works out the sign of a number: ```{r} if (num > 0) { # line 1 print(1) # line 2 } else if (num == 0) { # line 3 print(0) # line 4 } else { print(-1) # line 5 } num <- -3 num <- 0 num <- 2/3 ``` ## Quiz {.challenge} Which lines above are executed, and in what order, for each value of `num`? # Functions ::: {.omit} Functions are blocks of code stored in a special structure. They are modelled on the concept of a mathematical function - they receive inputs (arguments), perform operations on the inputs, and return results. Starting out in R mostly involves using functions that other people had written. We're now going to see how to create our own functions. ::: The general pattern of a function is: ``` FUNCTION_NAME <- function(ARGUMENT_NAME1, ARGUMENT_NAME2, ...) { FUNCTION_BODY } ``` Here is an example: ```{r} fahr_to_kelvin <- function(temp) { (temp-32) * (5/9) + 273.15 } ``` ::: {.omit} We define `fahr_to_kelvin` by assigning it a `function`. The list of argument names are containted within parentheses. Next, the body of the function--the statements that are executed when it runs--is contained within curly braces (`{}`). The statements in the body have been indented by four spaces, which makes the code easier to read but does not affect how the code operates. Note that there is a very important difference between _defining_ a function and _calling_, or running a function. When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. The expression inside the function is evaluated and the result is "returned" back to whoever asked for it. ::: Let's try running our function. Calling our own function is no different from calling any other function: ```{r} # freezing point of water fahr_to_kelvin(32) # boiling point of water fahr_to_kelvin(212) ``` ## Variations ::: {.omit} Here are some other equivalent ways to write this function. The braces are not actually needed if the function only contains one statement. ::: ```{r} fahr_to_kelvin <- function(temp) (temp-32) * (5/9) + 273.15 ``` ::: {.omit} The body of the function within the `{}` can contain multiple lines of code, "statements". Only the last line is returned. ::: ```{r} fahr_to_kelvin <- function(temp) { kelvin <- (temp-32) * (5/9) + 273.15 kelvin } ``` ::: {.omit} `return` can be used to return from a function immediately. Further code will not be run. ::: ```{r} fahr_to_kelvin <- function(temp) { kelvin <- (temp-32) * (5/9) + 273.15 return(kelvin) plot(1:10) } ``` ::: {.omit} Statements in the function may be split over several lines, so long as it is clear that the statement is incomplete due to an unclosed bracket or an operator in need of a right hand argument. ::: ```{r} fahr_to_kelvin <- function(temp) { kelvin <- (temp-32) * (5/9) + 273.15 return(kelvin) } ``` ::: {.omit} Spacing and layout is largely for our own sanity. ::: ```{r} fahr_to_kelvin<- function( temp){(temp -32 )*( 5 /9 )+ 273.15} ``` ::: {.omit} Beware! however of accidentally finishing a statement early. ::: ```{r} fahr_to_kelvin_broken <- function(temp) { (temp-32) * (5/9) + 273.15 } fahr_to_kelvin_broken(212) ``` ::: {.omit} The first line in the body of the function is computed then discarded. The second line, `+ 273.15`, is valid R so we don't even get an error. Finally, the computer doesn't understand or care what names we give to functions, arguments, or variables. ::: ```{r} charmander <- function(bulbasaur) { mew <- (bulbasaur-32) * (5/9) + 273.15 mew } ``` ## Composing Functions ::: {.omit} Now that we've seen how to turn Fahrenheit into Kelvin, it's easy to turn Kelvin into Celsius: ::: ```{r} kelvin_to_celsius <- function(temp) { temp - 273.15 } #absolute zero in Celsius kelvin_to_celsius(0) ``` ::: {.omit} What about converting Fahrenheit to Celsius? We could write out the formula, but we don't need to. Instead, we can compose the two functions we have already created: ::: ```{r} fahr_to_celsius <- function(temp) { temp_k <- fahr_to_kelvin(temp) temp_c <- kelvin_to_celsius(temp_k) temp_c } # freezing point of water in Celsius fahr_to_celsius(32.0) ``` ::: {.omit} This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here--typically half a dozen to a few dozen lines--but they shouldn't ever be much longer than that, or the next person who reads it won't be able to understand what's going on. Function names and argument names are important cues to what code is doing. Functions also help manage code complexity by cleaning up their contents after completion. You might have noticed that all of these functions have an argument called `temp`. Why hasn't this caused confusion and chaos? The answer is that arguments to a function and variables defined in the body of a function are *local* to a call to that function. Within the body of a function, R code can see variables from its own local environment, and variables from the global environment, but *not* variables local to other function calls. While running code R maintains a *stack* of local environments. When a call to a function returns, all of its local arguments and variables disappear. Tip: There are several ways to see this in action. * Scatter `cat`s through your functions showing the values of arguments and variables at various points. * Insert a call to `browser()` in one of your functions. This will pause execution and let you interact with the local environment in a function. Once in the browser, type "help" for a list of commands. Besides examining variables, you can step through the function from the point it paused with "n" or, at a finer grain, step into functions it calls with "s". When you are done type "f". * You can debug an existing function using `debugonce`. This is like temporarily adding a call to `browser()` at the top of the function. ::: ```{r, eval=FALSE} # debugging a function debugonce(fahr_to_celsius) fahr_to_celsius(212) ``` ## Challenge {.challenge} Write a function to calculate the length of the hypotenuse of a right angled triangle using Pythagorus's rule, given the lengths of the other sides. Hint: `sqrt` calculates the square root of a number. Testing your code is important. Invent a test case for your code consisting of: * The input arguments to your function. * The return value you expect. Confirm that your function works as expected. # For-loops ::: {.omit} There are three common approaches to repetitive tasks in R: 1. Copy and paste the code a few times. (Don't do this.) 2. Write a for-loop. 3. Write a function, and give this as an argument to an existing function that will call it repeatedly. Method 1 is error prone, and also hard to fix if you find a mistake in your code. We will now look at method 2, using for-loops. ::: The general pattern of a loop is: ``` for(VARIABLE_NAME in VECTOR) { FOR_LOOP_BODY } ``` Let's look at an example. Suppose we have a calculation we need to perform on a series of numbers. ```{r results="hold"} i <- 10 cat("i is",i,"\n") cat("i squared is",i*i,"\n") i <- 20 cat("i is",i,"\n") cat("i squared is",i*i,"\n") i <- 30 cat("i is",i,"\n") cat("i squared is",i*i,"\n") i <- 40 cat("i is",i,"\n") cat("i squared is",i*i,"\n") i <- 50 cat("i is",i,"\n") cat("i squared is",i*i,"\n") ``` Apart from the `i <- ...` lines, the code is the same each time. This can be re-written as a for-loop: ```{r results="hold"} for(i in c(10,20,30,40,50)) { # 1 cat("i is",i,"\n") # 2 4 6 8 10 cat("i squared is",i*i,"\n") # 3 5 7 9 11 } # # cat("done\n") # 12 # --order-of-execution--> ``` ::: {.omit} We can name the loop variable anything we like. `in` is part of the `for` syntax. The body of the loop is enclosed in curly braces `{ }`. For a single-line loop body the braces aren't needed, but it is good practice to always include them. The loop variable `i` is assigned each element in the vector in turn, and then the loop body runs. There's no possibility here of making a mistake when copying and pasting the code, and if we later discover we need to change this code, the for-loop version will be much easier to change. It will also be possible to run this code a different number of times by changing the vector used. ::: ## Challenge {.challenge} As a small group, compare your answers as you go. 1. Write down what these lines of code do in English. Check they do what you expect by running them in R. Try adding `cat`s or `print`s to the for-loop body to check what is going on. ```{r eval=FALSE} myvec <- c(10,20,30,40) total <- 0 for(item in myvec) { total <- total + item } total ``` 2. Write down what these lines of code do in English. How could this be changed to work with any length of `myvec`? ```{r eval=FALSE} myvec <- c(10,20,30,40) for(index in 1:4) { myvec[index] <- myvec[index] * 2 } ``` 3. Write the steps to calculate *n* factorial in English, i.e. `1*2*3* ... *n`. 4. Write R code to calculate 10 factorial. ```{r eval=FALSE} numbers <- 1:10 ... your code here ... ``` # A practical programming exercise ::: {.omit} Let's say we want to examine the quality of some FASTQ files, which contain reads from a DNA sequencing machine. An experiment has been performed over several days, and we want to run a program called "fastqc" on each of the FASTQ files. FASTQ is a text format containg a series of DNA sequences and associated quality information. Examine "Day0.fastq" in the "r-progtidy-files" folder. (This data is a very small portion of the reads from an experiment inducing pluripotency in a mouse cell line, [GSE70022](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70022).) ::: ## Running external software ::: {.omit} Software outside of R is normally run from a "terminal" window, which is rather like the console in RStudio. You type in a command and the operating system runs it for you. The commands you type in are interpreted by something called a "shell", because it acts like a shell around the operating system. For more information, see the [Software Carpentry](http://software-carpentry.org/lessons/) course on the Unix shell. ::: In RStudio, you can open a terminal using "Tools/Terminal/New Terminal". Try this and type in: ``` uptime ``` The "uptime" command tells you how long the computer has been running for. R can give a command to the shell using the `system` function. ```{r} system("uptime") ``` Here is how we want to run FastQC for day 0. To R the command has no meaning beyond being a character string. It will be interpreted by the shell, which is external to R. The shell will then run FastQC for us. ```{r cache=TRUE} system("FastQC/fastqc --extract --outdir . r-progtidy-files/Day0.fastq") ``` ::: {.omit} In the "Files" tab of the bottom right pane in RStudio, you should see that some new files and a directory have been created. Click on "Day0_fastqc.html" to examine it. Also examine the file in the "Day0_fastqc" folder called "summary.txt". ::: **If you don't have FastQC, or it failed to run for some reason:** The expected output files can be found in the folder "r-progtidy-files/fastqc-output". ## Using a for-loop ::: {.omit} Typing out the fastqc command for each day will get repetitive. Let's use a for-loop to automate this task. The `paste0` function lets us "paste" together character strings, so we can use that to construct the commands to run, like this: ::: ```{r} # construct a command to run day <- 0 command <- paste0("FastQC/fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq") command ``` ::: {.omit} We want to do this for each day, and then use `system` to run the resulting command: ::: ```{r cache=TRUE} # run fastqc on each of the files days <- c(0,4,7,10,15,20) for(day in days) { command <- paste0("FastQC/fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq") cat("Running the command:", command, "\n") system(command) } ``` ## Loading the summary.txt files **Note:** If you weren't able to run FastQC earlier, the expected output files can be found in the folder "r-progtidy-files/fastqc-output". You will need to adjust the filenames appropriately in the R code below. ::: {.omit} The summary.txt files are in "tab separated value" format. This is similar to comma separated value format we've seen in the introductory R day, but instead of using a comma it uses a special character called a tab to delimit columns. Tabs show up as variable amounts of space in a text editor. In R, they can be written `"\t"`. In base R we could use the function `read.delim` to read this file. It's quite similar to `read.csv`. However, let's use the Tidyverse package `readr` to load the file. To use `readr` we first need to load it with `library`. ::: ```{r} library(readr) ``` ::: {.omit} We can now read the file like this: ::: ```{r} read_tsv("Day0_fastqc/summary.txt") ``` ::: {.omit} Oh! There aren't column headings in this file. We need to tell `read_tsv` this. ::: ```{r} read_tsv("Day0_fastqc/summary.txt", col_names=FALSE) ``` ::: {.omit} Finally, `readr` is nudging us to specify the column types. Otherwise it has to guess, and sometimes guesses wrong. Not guessing things is especially important for programming as opposed to interactive use of R, as we won't immediately see the mistake. Have a look at `?read_tsv` to learn about using `col_types`. ::: ```{r} read_tsv("Day0_fastqc/summary.txt", col_names=FALSE, col_types="ccc") ``` We should tidy this up a bit before using it. Columns should have meaningful names, and the PASS/WARN/FAIL grading looks like it should be a factor: ```{r} filename <- "Day0_fastqc/summary.txt" sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc") colnames(sumtab) <- c("grade", "test", "file") sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS")) sumtab ``` We expect to have to examine many FastQC reports in future. It will be convenient to have this as a function! ::: {.omit} `filename` is going to be different each time we use this code, so it wants to be an argument to the function. The rest of the code goes in the body of the function. The returned value will be `sumtab`. ::: ```{r} load_fastqc <- function(filename) { sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc") colnames(sumtab) <- c("grade", "test", "file") sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS")) sumtab } load_fastqc("Day0_fastqc/summary.txt") ``` ## Applying the function ::: {.omit} We now want to load each of the `summary.txt` files we created earlier. To do this we're going to use an "apply"-style function. "apply"-style functions are functions that call some other function repeatedly on subsets of a data set. Some examples are `apply`, `lapply`, `vapply`, and `tapply`. `apply` itself is for use with matrices and usually produces a vector as output. Other functions in this family have different input and output types. Writing code that makes use of this type of function is called "functional programming". The flavour of apply we need here is `lapply`, which takes a vector and calls a function on each element in turn of that vector (i.e. the "subsets" here are simply individual elements). The result is a list containing the returned values, hence the name `lapply`. ::: First let's work out the names of the files we want to load. ```{r} days <- c(0,4,7,10,15,20) filenames <- paste0("Day", days, "_fastqc/summary.txt") filenames ``` Now we can use `lapply` to apply our function to each of the filenames. ```{r message=F} sumtabs <- lapply(filenames, load_fastqc) ``` Using `lapply` is "functional programming", where we define a series of objects by using functions on earlier objects. In this approach objects are never modified once created. Sometimes the objects used are themselves functions. See also: `purrr` package. An alternative would be to use a for loop. Using a for loop is "procedural programming", where the result is modified step by step until we obtain the value we want: ```{r} sumtabs <- list() for(idx in 1:length(filenames)) { sumtabs[[idx]] <- load_fastqc(filenames[idx]) } ``` `sumtabs` is a list. Lists can contain things of different types and different sizes. Here we have a list of data frames. Lists are a little different from the vectors we usually work with. See section 20.5 in the ["R for data science"](http://r4ds.had.co.nz/vectors.html) book for a comprehensive explanation. Here are some ways to examine this object: ```{r eval=FALSE} sumtabs class(sumtabs) length(sumtabs) str(sumtabs) ``` ::: {.omit} When we print `sumtabs` on the console, it gives us a hint how to access individual elements: using the `[[ ]]` syntax: ::: ```{r eval=FALSE} sumtabs[[1]] sumtabs[[2]] ``` It would be nicer if this was a single big data frame. The `dplyr` package provides a function to do this: ```{r message=F, warning=F} library(dplyr) bigtab <- bind_rows(sumtabs) bigtab table(bigtab$test, bigtab$grade) ``` ## Summary What we've just done is: 1. Create a custom reader tool that sets up each file inside R 2. Use a looping tool to apply that function to a collection of file names 3. Transform the results produced by the looping tool into a data frame that we can use for other purposes. ## Improving load_fastqc Our `load_fastqc` function will currently fail in a confusing way if it is given the wrong file: ```{r error=TRUE} load_fastqc("Day0_fastqc/fastqc_data.txt") ``` It's best to stop as soon as something goes wrong, and with an informative error message. ```{r error=TRUE} load_fastqc <- function(filename) { # Load file sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc") # Check number of columns if (ncol(sumtab) != 3) { stop("Wrong number of columns in file!") } # Make it nicer to work with colnames(sumtab) <- c("grade", "test", "file") sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS")) sumtab } load_fastqc("Day0_fastqc/fastqc_data.txt") ``` Checking input arguments before proceeding is a common pattern when writing R functions. # Sourcing .R files Having developed some useful functions, we might want to re-use them in a future project or share them with a friend. We can store R code in a file, usually with extension ".R", and load it with the `source` function. ::: {.omit} Put your `load_fastqc` function in a file called "fastqc.R". It uses the `readr` library, so be sure to load the library in the file as well. ::: ``` # fastqc.R file should contain: library(readr) load_fastqc <- function(filename) { # Load file sumtab <- read_tsv(filename, col_names=FALSE, col_types="ccc") # Check number of columns if (ncol(sumtab) != 3) { stop("Wrong number of columns in file!") } # Make it nicer to work with colnames(sumtab) <- c("grade", "test", "file") sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS")) sumtab } ``` ```{r eval=FALSE} # From the console: source("fastqc.R", local=TRUE) ``` ::: {.omit} Everything in the file runs as though it were typed on the console. Similarly, if you are going to be doing several different things with a data set, you could write a .R file to load the data into a tidy form, and several other .R scripts to do various different things with it. ::: # Organizing a project I like to organize large projects into multiple folders: ``` /raw Raw data (do not edit!) /output Output files /scripts R scripts /reports Reports in R Markdown README.txt ``` Use version control, such as Git + GitHub. Some tips I've gathered: * Working on a project is iterative. Sometimes we put code for multiple approaches side by side. Sometimes we use version control features such as branches. * Coming back to a project, we've regretted not writing down the order things need to be run to fully reproduce outputs and reports, and what the preferred approach was if we tried many things. * Some people like `workflowr`, which automates working on a large project. * The `here` package is useful when you have multiple folders. See also Software Carpentry's list of best practices in R: http://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/ See also Data Fluency workshop "Introduction to Data Organization." # Packages Packages are the next step up from sourcing .R files. They let you write code that other people can install and then load with `library`. Hadley Wickham has a package called [devtools](https://cran.r-project.org/web/packages/devtools/index.html) that takes a lot of the pain out of package writing. Also useful is the package [usethis](https://cran.r-project.org/web/packages/usethis/index.html), which automates creation of a package directory and basic files, and can set up various other parts of a package. Packages generally contain documentation and example code for all functions, and one or more vignettes describing how to use the package. Function documentation is usually written as specially formatted comments before each function using [roxygen2](https://cran.r-project.org/web/packages/roxygen2/index.html). ```{r, eval=FALSE} library(devtools) library(usethis) # Create an empty package template create_package("mypack", open=FALSE) # ... Edit mypack/DESCRIPTION file # ... Write .R files in mypack/R folder # For example create a file mypack/R/hello.R containing: # # # # # #' Greeting function #' #' This function displays a greeting message. #' #' @examples #' hello() #' #' @export hello <- function() { cat("Hello, world.\n") } # # # # # # Build package documentation, converting inline documentation to .Rd files using roxygen2. # Update NAMESPACE file (lists functions the package exports for public consumption). document("mypack") # Load package. Use this during development. load_all("mypack") hello() # Check for common problems and missing documentation. # This also automatically runs document( ). # A CRAN package must pass all checks. check("mypack") ``` ::: {.omit} `create_package` creates a `.Rproj` project file for the package. If you open this project, RStudio changes to the package directory, and there is a new "Build" tab. This provides tools that parallel what Hadley's `devtools` package provides with `load_all` and `check`, you can decide which you prefer! Old-school R developers may also use `R CMD` commands from the terminal (try typing `R -h` in the "Terminal" tab). * [Hadley Wickham's "R packages" book](http://r-pkgs.had.co.nz/) * [Official package writing manual](https://cran.r-project.org/doc/manuals/r-release/R-exts.html) ::: Packages are most easily distributed by placing them on GitHub or GitLab. They can then be installed by others using the `remotes` package functions `install_github` or `install_gitlab`. ```{r eval=FALSE} # To install from GitHub: remotes::install_github("myusername/mypack") ``` Once a package is mature and well documented, it can be submitted to CRAN or Bioconductor. **CRAN submissions:** CRAN is quite strict about packages passing automated checks, and they will also manually review your package. Submission is by uploading the package tarball using a web form. Each new version also gets a brief manual review. **Bioconductor submissions:** Bioconductor is for packages that deal with large biology-related datasets. Packages should make use of existing Bioconductor data types and infrastructure where possible. Bioconductor has some additional automated checks you will need to pass. The initial review process is more intensive than CRAN's and happens in GitHub, so some familiarity with git will be necessary. Once the package is accepted, changes can be pushed to the Bioconductor git repositories without further manual review. You are also expected to be available to answer questions on their Stack Overflow-style support site and subscribe to the developer mailing list. ```{r} # (Recording the R version and versions of packages used improves reproducibility.) sessionInfo() ```