R history





The new S language,
published 1988,
introduced functions, …


Statistical Models in S,
published 1991,
introduced data frames, …

S - 1976

R - 1993


We will be focussing on the modern Tidyverse approach.

  • “tidy data”: the data frame (“tibble”) is the one true data structure
  • dplyr, ggplot2, tidyr, purrr, etc
  • mostly written by Hadley Wickham
  • simpler, safer, and less “helpful”

Analysis cycle

From the “R for Data Science” book

Data analysis isn’t just statistical modelling.

The greatest value of a picture is when it forces us
to notice what we never expected to see.

– John Tukey

Topics covered today


For modelling, get started with Data Fluency’s “Linear Models in R” workshop.

Programming

Use structures such as loops to automate repetitive tasks

for(i in 1:10) {
    do_the_thing(i)
}

Follow a script

source("myscript.R")
  • Runs script, command by command, to completion (or first error).

Or: use Rmarkdown to produce a document.

Define new functions for interactive or scripted use

source("myfunctions.R")

Or: write an R package.

Programming, from a scientist’s perspective

If every step of your analysis is recorded in an R script, with no manual steps:

  • you have a complete record of what you have done
  • easy to run entire script with test data
  • changes easily tested, poor early decisions easily fixed
  • today’s big project becomes a function in a package, serves as tomorrow’s building block


Programming is an essential part of reproducible research.

  • other researchers can precisely understand and verify your work

Algorithmic thinking

Programming involves two very different activities.


1. Thinking through the problem

  • Define the problem precisely.

  • Anticipate things that might go wrong or need special handling.

  • Come up with a step-by-step solution, using phrases like “for each” and “if”.

This is algorithmic thinking. It doesn’t need a computer!


2. Turning the step-by-step solution into R code

Algorithmic thinking

Suppose you want to add the numbers 5, 3, 9, 7.

We can write down steps to solve this, then convert them to R code.



Start with a total of 0.

Add 5 to the total.

Add 3 to the total.

Add 9 to the total.

Add 7 to the total.

The total now is the answer.
total <- 0

total <- total + 5

total <- total + 3

total <- total + 9

total <- total + 7

total

Algorithmic thinking

Suppose you want to add up a collection of numbers in general.

We can write down steps to solve this, then convert them to R code.







Start with a total of 0.

For each number x in the collection: 
    Add x to the total.


The answer is the final value of total.
collection <- c(5, 3, 9, 7)



total <- 0

for(x in collection) {
    total <- total + x
}

total

Algorithmic thinking

Like a cooking recipe or a lab protocol, we can write down the steps once, then use them whenever we need to solve this problem.



This is how to add up the numbers in a collection:

    Start with a total of 0.

    For each number x in the collection: 
        Add x to the total.


    The answer is the final value of total.




Now add up 5, 3, 9, and 7.
addup <- function(collection) {

    total <- 0

    for(x in collection) {
        total <- total + x
    }

    total

}


addup( c(3,5,9,7) )

(do programming section)

Data





As your programs get more complicated, you will also need ways to represent complex data.


Often the largest task is to get the data into the right form to apply a tool
such as ggplot, summarize, or lm.

Data

Vectors

  • A collection of a single kind of data:
    numeric, character, factor, logical
  • Single numbers are a vector of length 1.
  • Can have names( ).
c(x=1, y=2, z=3)

Lists

  • A special kind of vector that can hold any kind of data, including other vectors and lists.
  • If you need to bundle together a miscellaneous collection of data, lists are your solution. For example, a function that needs to return multiple results can return a list.
  • Play the same role as both the list and dict types in Python, or object and Array types in Javascript.
  • Access individual elements with [[ ]] or $.
list(x=TRUE, y="two", z=c(1,2,3))

Data

Data frames

  • Data frames hold tabular data where the columns may be different types.
  • Under the hood, a list of column vectors.
  • Tidyverse has an improved data frame called a “tibble”.

Others (not covered today)

  • Matrices hold tabular data all of the same type, usually numeric.
    Distinct from data frames in R!
    Can have rownames( ) and colnames( ).

  • “S3” objects are (usually) a list, with a special class attribute.
    Example: an lm object holding a linear model.

  • “S4” objects are a more formal approach to object-orientation.
    Most heavily used by the Bioconductor project.
    Example: a GRanges object holding genomic ranges.

Tidy data

Tidy data doesn’t mean tidy for a person to read, it means the easiest form for the computer to work with.

  • only use data frames
  • put all the data in a single data frame
  • each row is a single unit of observation
  • each column is a single piece of information

Similar to database design.

The experimental design is in the body of the table alongside the data, not in row names or column names.

If you have multiple columns containing the same kind of information, this is a hint the data is not tidy.

Not tidy …

… tidier … tidy

(“melt” = “gather” = “pivot_long”)

Learning more

(do tidyverse section)