This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form, collectively known as the “tidyverse”.

The packages we are using in this lesson are all from CRAN, so we can install them with install.packages. If you did this while setting up for this workshop, you won’t need to do it again now.

# install.packages("tidyverse")
library(tidyverse) # Load all "tidyverse" libraries.
# OR
# library(readr)   # Read tabular data.
# library(tidyr)   # Data frame tidying functions.
# library(dplyr)   # General data frame manipulation.
# library(ggplot2) # Flexible plotting.

These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:

vignette()
vignette(package="dplyr")
vignette("dplyr", package="dplyr")

We will continue using the FastQC output data frame. If you’re starting fresh for this lesson, load it with:

bigtab <- read_csv("r-progtidy-files/fastqc.csv")
## Rows: 72 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): grade, test, file
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dplyr

dplyr gives us a collection of convenient “verbs” for manipulating data frames. The name is a contraction of “Data frame apPLYeR”, as some of these verbs have a similar flavour to apply/lapply/tapply/etc.

Each verb takes a data frame as input and returns a modified version of it. The idea is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.

# input         +--------+        +--------+        +--------+      result
#  data    |>   |  verb  |   |>   |  verb  |   |>   |  verb  |  ->   data
#   frame       +--------+        +--------+        +--------+        frame

Some of these verbs are covered in our introductory R workshop, so some of this may be familiar already.

Tibbles

Because we used readr to read the data, bigtab is a “tibble”, which is the Tidyverse’s improved data frame.

bigtab
## # A tibble: 72 × 3
##    grade test                         file      
##    <chr> <chr>                        <chr>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 FAIL  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## # ℹ 62 more rows

You can also create tibbles explicitly with the tibble function. One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame, or use the View function to view it in a tab.

as.data.frame(bigtab)
View(bigtab)

The n and width arguments to print can also be used to print more rows or columns respectively.

print(bigtab, n=100, width=1000)

filter( )

Say we want to know all the tests that failed. filter provides a convenient way to get rows matching a query.

filter(bigtab, grade == "FAIL")
## # A tibble: 8 × 3
##   grade test                      file       
##   <chr> <chr>                     <chr>      
## 1 FAIL  Per base sequence content Day0.fastq 
## 2 FAIL  Per base sequence content Day4.fastq 
## 3 FAIL  Per base sequence content Day7.fastq 
## 4 FAIL  Per sequence GC content   Day7.fastq 
## 5 FAIL  Per base sequence content Day10.fastq
## 6 FAIL  Per base sequence content Day15.fastq
## 7 FAIL  Per base sequence content Day20.fastq
## 8 FAIL  Per sequence GC content   Day20.fastq

Something is magic here: we do not have grade in our environment, only within the data frame. Arguments to filter can use any column of the data frame as a variable. dplyr uses this idea heavily, arguments to dplyr verbs often behave in a special way.

arrange( )

Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange sorts a data frame by one or more columns.

arrange(bigtab, grade)
## # A tibble: 72 × 3
##    grade test                      file       
##    <chr> <chr>                     <chr>      
##  1 FAIL  Per base sequence content Day0.fastq 
##  2 FAIL  Per base sequence content Day4.fastq 
##  3 FAIL  Per base sequence content Day7.fastq 
##  4 FAIL  Per sequence GC content   Day7.fastq 
##  5 FAIL  Per base sequence content Day10.fastq
##  6 FAIL  Per base sequence content Day15.fastq
##  7 FAIL  Per base sequence content Day20.fastq
##  8 FAIL  Per sequence GC content   Day20.fastq
##  9 PASS  Basic Statistics          Day0.fastq 
## 10 PASS  Per base sequence quality Day0.fastq 
## # ℹ 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 × 3
##    grade test                      file       
##    <chr> <chr>                     <chr>      
##  1 WARN  Per sequence GC content   Day0.fastq 
##  2 WARN  Overrepresented sequences Day0.fastq 
##  3 WARN  Per sequence GC content   Day4.fastq 
##  4 WARN  Per sequence GC content   Day10.fastq
##  5 WARN  Overrepresented sequences Day10.fastq
##  6 WARN  Per sequence GC content   Day15.fastq
##  7 WARN  Overrepresented sequences Day15.fastq
##  8 WARN  Overrepresented sequences Day20.fastq
##  9 PASS  Basic Statistics          Day0.fastq 
## 10 PASS  Per base sequence quality Day0.fastq 
## # ℹ 62 more rows

Joins

Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:

fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))

scoring
## # A tibble: 3 × 2
##   grade score
##   <fct> <dbl>
## 1 FAIL    0  
## 2 WARN    0.5
## 3 PASS    1

dplyr has several join functions, including left_join, right_join, inner_join, full_join, and anti_join. The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join discards such rows. full_join always keeps them, filling in missing data with NA. left_join always keeps rows from the first data frame. right_join always keeps rows from the second data frame. anti_join is a bit different, it gives you rows from the first data frame that aren’t in the second data frame.

We can use a join to augment a data frame with some extra information. left_join is a good default choice for this as it will never delete rows from the data frame that is being augmented.

scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 × 4
##    grade test                         file       score
##    <chr> <chr>                        <chr>      <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1  
##  2 PASS  Per base sequence quality    Day0.fastq   1  
##  3 PASS  Per tile sequence quality    Day0.fastq   1  
##  4 PASS  Per sequence quality scores  Day0.fastq   1  
##  5 FAIL  Per base sequence content    Day0.fastq   0  
##  6 WARN  Per sequence GC content      Day0.fastq   0.5
##  7 PASS  Per base N content           Day0.fastq   1  
##  8 PASS  Sequence Length Distribution Day0.fastq   1  
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1  
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5
## # ℹ 62 more rows

The grade columns act as the keys of the two data frames, allowing corresponding rows in the data frames to be joined together.

One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.

Joins are the key tool for integrating different types of data, based on some common key such as gene ID.

See also: match and merge in base R.

summarize( )

summarize lets us compute summaries of data.

summarize(scoretab, average_score=mean(score))
## # A tibble: 1 × 1
##   average_score
##           <dbl>
## 1         0.833

In this, average_score is used to name the new column. Any summary function could be used in place of mean, such as sum, median, sd.

We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by “adjective”. group_by gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize.

summarize when used with group_by turns each group of rows into a single row:

group_by(scoretab, file)
## # A tibble: 72 × 4
## # Groups:   file [6]
##    grade test                         file       score
##    <chr> <chr>                        <chr>      <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1  
##  2 PASS  Per base sequence quality    Day0.fastq   1  
##  3 PASS  Per tile sequence quality    Day0.fastq   1  
##  4 PASS  Per sequence quality scores  Day0.fastq   1  
##  5 FAIL  Per base sequence content    Day0.fastq   0  
##  6 WARN  Per sequence GC content      Day0.fastq   0.5
##  7 PASS  Per base N content           Day0.fastq   1  
##  8 PASS  Sequence Length Distribution Day0.fastq   1  
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1  
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5
## # ℹ 62 more rows
summarize(group_by(scoretab, file), average_score=mean(score))
## # A tibble: 6 × 2
##   file        average_score
##   <chr>               <dbl>
## 1 Day0.fastq          0.833
## 2 Day10.fastq         0.833
## 3 Day15.fastq         0.833
## 4 Day20.fastq         0.792
## 5 Day4.fastq          0.875
## 6 Day7.fastq          0.833

The special function n() can be used within summarize to get the number of rows. (This also works in mutate, but is most useful in summarize.)

summarize(group_by(scoretab, grade), count=n())
## # A tibble: 3 × 2
##   grade count
##   <chr> <int>
## 1 FAIL      8
## 2 PASS     56
## 3 WARN      8

See also: shortcut functions for simple uses of summarize count and distinct.

See also: table and tapply in base R.

Tip: group_by also affects other verbs, such as mutate. This is more advanced than we will be going into today. Grouping can be removed with ungroup.

The pipe |>

We often want to string together a series of dplyr functions. This is achieved using R’s pipe operator, |>. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:

scoretab |> group_by(grade) |> summarize(count=n())

In older books and web pages you may see %>% used instead of |>. This is a predecessor to the now standardized R pipe, defined in the magrittr package.

|> isn’t limited to dplyr functions. It’s an alternative way of writing any R code.

rep(paste("hello", "world"), 5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
"hello" |> paste("world") |> rep(5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"

Often for readability we will write a pipeline over several lines:

scoretab |>
    group_by(grade) |> 
    summarize(count=n())

The idea of adding geoms in ggplot2 is rather like the dplyr pipe. The only reason pipes aren’t used in ggplot2 is that it predates dplyr and the invention of the pipe.

Challenge

Write a pipeline using |>s that starts with bigtab, joins the scoring table, and then calculates the average score for each test.

mutate( )

A couple more verbs will complete our core vocabulary. mutate lets us add or overwrite columns by computing a new value for them.

mutate(scoretab, doublescore = score*2)

Equivalently:

scoretab |>
    mutate(doublescore = score*2)
## # A tibble: 72 × 5
##    grade test                         file       score doublescore
##    <chr> <chr>                        <chr>      <dbl>       <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1             2
##  2 PASS  Per base sequence quality    Day0.fastq   1             2
##  3 PASS  Per tile sequence quality    Day0.fastq   1             2
##  4 PASS  Per sequence quality scores  Day0.fastq   1             2
##  5 FAIL  Per base sequence content    Day0.fastq   0             0
##  6 WARN  Per sequence GC content      Day0.fastq   0.5           1
##  7 PASS  Per base N content           Day0.fastq   1             2
##  8 PASS  Sequence Length Distribution Day0.fastq   1             2
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1             2
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5           1
## # ℹ 62 more rows

Even though this is called “mutate”, it is not literally modifying the input. Rather it is producing a copy of the data frame that has the modification.

The above is equivalent to:

scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2

select( )

dplyr’s select function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.

select(bigtab, test,grade) 
## # A tibble: 72 × 2
##    test                         grade
##    <chr>                        <chr>
##  1 Basic Statistics             PASS 
##  2 Per base sequence quality    PASS 
##  3 Per tile sequence quality    PASS 
##  4 Per sequence quality scores  PASS 
##  5 Per base sequence content    FAIL 
##  6 Per sequence GC content      WARN 
##  7 Per base N content           PASS 
##  8 Sequence Length Distribution PASS 
##  9 Sequence Duplication Levels  PASS 
## 10 Overrepresented sequences    WARN 
## # ℹ 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 × 2
##    test                         grade
##    <chr>                        <chr>
##  1 Basic Statistics             PASS 
##  2 Per base sequence quality    PASS 
##  3 Per tile sequence quality    PASS 
##  4 Per sequence quality scores  PASS 
##  5 Per base sequence content    FAIL 
##  6 Per sequence GC content      WARN 
##  7 Per base N content           PASS 
##  8 Sequence Length Distribution PASS 
##  9 Sequence Duplication Levels  PASS 
## 10 Overrepresented sequences    WARN 
## # ℹ 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 × 3
##    foo        bar                          baz  
##    <chr>      <chr>                        <chr>
##  1 Day0.fastq Basic Statistics             PASS 
##  2 Day0.fastq Per base sequence quality    PASS 
##  3 Day0.fastq Per tile sequence quality    PASS 
##  4 Day0.fastq Per sequence quality scores  PASS 
##  5 Day0.fastq Per base sequence content    FAIL 
##  6 Day0.fastq Per sequence GC content      WARN 
##  7 Day0.fastq Per base N content           PASS 
##  8 Day0.fastq Sequence Length Distribution PASS 
##  9 Day0.fastq Sequence Duplication Levels  PASS 
## 10 Day0.fastq Overrepresented sequences    WARN 
## # ℹ 62 more rows

select has a special syntax for more complicated column selections. Read about it here. For example, you can remove a specific column like this:

select(bigtab, !file)
## # A tibble: 72 × 2
##    grade test                        
##    <chr> <chr>                       
##  1 PASS  Basic Statistics            
##  2 PASS  Per base sequence quality   
##  3 PASS  Per tile sequence quality   
##  4 PASS  Per sequence quality scores 
##  5 FAIL  Per base sequence content   
##  6 WARN  Per sequence GC content     
##  7 PASS  Per base N content          
##  8 PASS  Sequence Length Distribution
##  9 PASS  Sequence Duplication Levels 
## 10 WARN  Overrepresented sequences   
## # ℹ 62 more rows

With filter, arrange, mutate, select, group_by, summarize, and joins, we have a core vocabulary for manipulating data frames in pipelines.

tidyr

tidyr is the Tidyverse package for getting data frames to tidy. In a tidy data frame:

  • each row is a single unit of observation
  • each column is a single piece of information

I’m not convinced tidyr is a great package, but the ideas behind it are very important. dplyr uses magic arguments to produce beautiful R code, but in tidyr it gets a bit confusing. Anyway, let’s work through a toy example:

untidy <- read_csv(
    "country,     male-young, male-old, female-young, female-old
     Australia,            1,        2,            3,          4
     New Zealand,          5,        6,            7,          8")
## Rows: 2 Columns: 5
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (4): male-young, male-old, female-young, female-old
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
untidy
## # A tibble: 2 × 5
##   country     `male-young` `male-old` `female-young` `female-old`
##   <chr>              <dbl>      <dbl>          <dbl>        <dbl>
## 1 Australia              1          2              3            4
## 2 New Zealand            5          6              7            8

In this example, the first problem is that rows are not distinct units of observation, there are actually four observations per row. This is fixed using pivot_longer. (This operation has in the past also been called gather or melt.)

longer <- pivot_longer(untidy, cols=!country, names_to="group", values_to="cases")
longer
## # A tibble: 8 × 3
##   country     group        cases
##   <chr>       <chr>        <dbl>
## 1 Australia   male-young       1
## 2 Australia   male-old         2
## 3 Australia   female-young     3
## 4 Australia   female-old       4
## 5 New Zealand male-young       5
## 6 New Zealand male-old         6
## 7 New Zealand female-young     7
## 8 New Zealand female-old       8

We give a data frame, a specification of columns to use (cols=), a column to put the column names in (names_to=) and a column to put the values in (values_to=). The column specification behaves like select. Here we’ve asked to used all columns except country. We could equivalently have put cols=2:5.

pivot_wider is the opposite operation, spreading two columns into many columns, which generally makes data less tidy but is sometimes necessary. A common application is producing a scatter plot, where the x and y axes need to be two different columns even though they measure the same type of thing. Data may also be easier to look at in a table in spread form.

pivot_wider(longer, names_from=group, values_from=cases)
## # A tibble: 2 × 5
##   country     `male-young` `male-old` `female-young` `female-old`
##   <chr>              <dbl>      <dbl>          <dbl>        <dbl>
## 1 Australia              1          2              3            4
## 2 New Zealand            5          6              7            8
pivot_wider(bigtab, names_from=file, values_from=grade)
## # A tibble: 12 × 7
##    test     Day0.fastq Day4.fastq Day7.fastq Day10.fastq Day15.fastq Day20.fastq
##    <chr>    <chr>      <chr>      <chr>      <chr>       <chr>       <chr>      
##  1 Basic S… PASS       PASS       PASS       PASS        PASS        PASS       
##  2 Per bas… PASS       PASS       PASS       PASS        PASS        PASS       
##  3 Per til… PASS       PASS       PASS       PASS        PASS        PASS       
##  4 Per seq… PASS       PASS       PASS       PASS        PASS        PASS       
##  5 Per bas… FAIL       FAIL       FAIL       FAIL        FAIL        FAIL       
##  6 Per seq… WARN       WARN       FAIL       WARN        WARN        FAIL       
##  7 Per bas… PASS       PASS       PASS       PASS        PASS        PASS       
##  8 Sequenc… PASS       PASS       PASS       PASS        PASS        PASS       
##  9 Sequenc… PASS       PASS       PASS       PASS        PASS        PASS       
## 10 Overrep… WARN       PASS       PASS       WARN        WARN        WARN       
## 11 Adapter… PASS       PASS       PASS       PASS        PASS        PASS       
## 12 Kmer Co… PASS       PASS       PASS       PASS        PASS        PASS

In our toy example, we have a further problem that the “group” column contains two pieces of data. This can be fixed with separate. By default separate splits on any non-alphanumeric characters, but different separator characters can be specified.

separate(longer, col=group, into=c("gender","age"))
## # A tibble: 8 × 4
##   country     gender age   cases
##   <chr>       <chr>  <chr> <dbl>
## 1 Australia   male   young     1
## 2 Australia   male   old       2
## 3 Australia   female young     3
## 4 Australia   female old       4
## 5 New Zealand male   young     5
## 6 New Zealand male   old       6
## 7 New Zealand female young     7
## 8 New Zealand female old       8

All of this would typically be written as a single pipline:

tidied <- untidy |>
    pivot_longer(cols=!country, names_to="group", values_to="cases") |>
    separate(group, into=c("gender","age"))

Finally, we mention that pivot_longer has features we haven’t explored, and it is actually possible to do this in one step. See the tidyr vignettes.

pivot_longer(
    untidy, cols=!country, 
    names_to=c("gender","age"), names_sep="-", values_to="cases") 
## # A tibble: 8 × 4
##   country     gender age   cases
##   <chr>       <chr>  <chr> <dbl>
## 1 Australia   male   young     1
## 2 Australia   male   old       2
## 3 Australia   female young     3
## 4 Australia   female old       4
## 5 New Zealand male   young     5
## 6 New Zealand male   old       6
## 7 New Zealand female young     7
## 8 New Zealand female old       8

Advanced: tidyr has a number of other useful data frame manipulations. For completeness, we mention nest. This creates a list column in a tibble, the list column containing one tibble per row. Yo, Hadley put tibbles in your tibble so you can dplyr while you dplyr. The inverse operation is unnest. unnest is rather like the bind_rows function we saw earlier. nest and unnest go well with creating or modifying list columns in a mutate with lapply (or the purrr package).

# Advanced
nested <- nest(tidied, data=c(gender, age, cases))
nested
nested$data
unnest(nested, data)

Challenge

You receive data on a set of points. The points are in two dimensions (dim), and each point has x and y coordinates. Unfortunately it looks like this:

df <- read_csv(
    "dim, E_1, E_2, M_1, M_2, M_3, M_4, M_5
     x,   2,   4,   1,   2,   3,   4,   5
     y,   4,   4,   2,   1,   1,   1,   2")
## Rows: 2 Columns: 8
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): dim
## dbl (7): E_1, E_2, M_1, M_2, M_3, M_4, M_5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Tidy the data by pivoting longer all of the columns except dim. What does each row now represent?

  2. We want to plot the points as a scatter-plot, using either plot or ggplot. Pivot the long data wider so that this is possible. Now what do the rows represent?

  3. The data seems to be divided into “E” and “M” points. How could we make a column containing “E”s and “M”s?

An RNA-Seq example

We will now put this all together by looking at a small-scale gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).

This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.

Recall the diagram from the “R for Data Science” book. We will here focus on the “import”, “tidy”, “transform”, and “visualize” steps.

Importing and tidying

Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.

# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-progtidy-files/read-counts.csv")
## Rows: 45 Columns: 41
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Feature
## dbl (40): WT:1, WT:2, WT:3, WT:4, WT:5, WT:6, WT:7, WT:8, WT:9, WT:10, SET1:...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
counts <- counts_untidy |>
    pivot_longer(cols=!Feature, names_to="sample", values_to="count") |>
    separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) |>
    mutate(
        sample = factor(sample, levels=unique(sample)),
        strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6"))
    ) |>
    filter(time >= 2) |>
    select(sample, strain, time, gene=Feature, count)

summary(counts)
##      sample           strain         time        gene          
##  WT:2   :  45   WT       :405   Min.   : 2   Length:1620       
##  WT:3   :  45   SET1     :405   1st Qu.: 4   Class :character  
##  WT:4   :  45   RRP6     :405   Median : 6   Mode  :character  
##  WT:5   :  45   SET1-RRP6:405   Mean   : 6                     
##  WT:6   :  45                   3rd Qu.: 8                     
##  WT:7   :  45                   Max.   :10                     
##  (Other):1350                                                  
##      count       
##  Min.   :     0  
##  1st Qu.:   172  
##  Median :   666  
##  Mean   :  3612  
##  3rd Qu.:  2548  
##  Max.   :143592  
## 

Time point 1 was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they are at 15 minute intervals.

Transformation and normalization

ggplot(counts, aes(x=sample, y=count)) + 
    geom_boxplot() + 
    coord_flip()

We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box.

Perhaps a log transformation is in order.

ggplot(counts, aes(x=sample, y=log2(count))) + 
    geom_boxplot() + 
    coord_flip()
## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).

ggplot(counts, aes(x=log2(count), group=sample)) + 
    geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (`stat_density()`).

Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.

The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log-transform the data. We add a small constant when log-transforming so that zeros do not become negative infinity (as if adding 1 further read count to an average sample).

normalizer <- counts |>
    filter(gene == "SRP68") |>
    select(sample, norm=count)

moderation <- 1/mean(normalizer$norm)

counts_norm <- counts |>
    left_join(normalizer, by="sample") |>
    mutate(
        norm_count = count/norm, 
        log_norm_count = log2(norm_count+moderation)
    )

ggplot(counts_norm, aes(x=sample, y=log_norm_count)) + 
    geom_boxplot() + 
    coord_flip()

Note: Here all of the genes are potentially differentially expressed. For a full RNA-Seq data set we would probably assume most genes aren’t differentially expressed, so rather than nominating a normalizing gene we would normalize by total library size with, typically with a slight adjustment using the “TMM” method as calculated by calcNormFactors in edgeR:

# For a full sized RNA-Seq dataset:
library(edgeR)
mat <- counts_untidy |> 
    column_to_rownames("Feature") |> 
    as.matrix()
adjusted_lib_sizes <- colSums(mat) * calcNormFactors(mat)
normalizer_by_tmm <- tibble(
    sample=names(adjusted_lib_sizes), 
    norm=adjusted_lib_sizes)

(In practice, with a full size RNA-Seq dataset I would use the edgeR function cpm, which performs the same type of normalization and log transformation as above.)

Whatever your data is measuring, two common themes are:

  1. Measurement is usually normalized relative to something (housekeeping gene, overall library size, microscope image resolution, a ruler in a photograph, calibration run of a machine, …).

  2. If the measurement is by nature a positive number, consider log transformation. Two further indications this is appropriate are if you expect the effects of treatmeants to be multiplicative, or if the noise level scales with the magnitude of the measurement.

Visualization

The data is now suitably transformed and normalized for visualization. ggplot2 gives us a lot of freedom to juxtapose information from different columns.

Heatmaps using geom_tile and the use of facets allow the entire dataset to be viewed at once.

ggplot(counts_norm, aes(x=sample, y=gene, fill=log_norm_count)) + 
    geom_tile() + 
    scale_fill_viridis_c() + 
    theme_minimal() +
    theme(axis.text.x=element_text(           # Vertical text on x axis
              angle=90,vjust=0.5,hjust=1))              

Facetting the heatmap makes it easier to understand.

ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) + 
    geom_tile() + 
    facet_grid(~ strain) + 
    scale_fill_viridis_c() + 
    theme_minimal()

We are interested in the interaction of strain and time, so we can instead set these as the x and y axes for each gene.

ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) + 
    geom_tile() + 
    facet_wrap(~ gene) + 
    scale_fill_viridis_c() + 
    theme_minimal()

A famous paper by Cleveland and McGill (1984) ranks color last in terms of perceptual accuracy. Perhaps a different type of plot will be better?

A more traditional way of examining the interaction between two factors is a line graph called an “interaction plot”. We can also give each gene a separate scale.

ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) + 
    geom_line() + 
    facet_wrap(~ gene, scale="free")

Exercises

  1. Which are the three most variable genes?

Hint: intToUtf8(utf8ToInt("xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh")-3)

  1. Different genes have different average expression levels, but what we are interested in is how they change over time. Further normalize the data by subtracting the average for each gene from log_norm_count.

sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.1    
##  [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
##  [9] ggplot2_3.4.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.10        bslib_0.4.2       compiler_4.2.3    pillar_1.9.0     
##  [5] jquerylib_0.1.4   tools_4.2.3       bit_4.0.5         digest_0.6.31    
##  [9] viridisLite_0.4.1 timechange_0.2.0  jsonlite_1.8.4    evaluate_0.20    
## [13] lifecycle_1.0.3   gtable_0.3.3      pkgconfig_2.0.3   rlang_1.1.0      
## [17] cli_3.6.1         parallel_4.2.3    yaml_2.3.7        xfun_0.38        
## [21] fastmap_1.1.1     withr_2.5.0       knitr_1.42        generics_0.1.3   
## [25] vctrs_0.6.1       sass_0.4.5        hms_1.1.3         bit64_4.0.5      
## [29] grid_4.2.3        tidyselect_1.2.0  glue_1.6.2        R6_2.5.1         
## [33] fansi_1.0.4       vroom_1.6.1       rmarkdown_2.21    farver_2.1.1     
## [37] tzdb_0.3.0        magrittr_2.0.3    scales_1.2.1      htmltools_0.5.5  
## [41] colorspace_2.1-0  labeling_0.4.2    utf8_1.2.3        stringi_1.7.12   
## [45] munsell_0.5.0     cachem_1.0.7      crayon_1.5.2

References

Cleveland, William S., and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (387): 531–54. https://doi.org/10.1080/01621459.1984.10478080.