This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form, collectively known as the “tidyverse”.
The packages we are using in this lesson are all from CRAN, so we can
install them with install.packages
. If you did this while
setting up for this workshop, you won’t need to do it again now.
# install.packages("tidyverse")
library(tidyverse) # Load all "tidyverse" libraries.
# OR
# library(readr) # Read tabular data.
# library(tidyr) # Data frame tidying functions.
# library(dplyr) # General data frame manipulation.
# library(ggplot2) # Flexible plotting.
These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:
vignette()
vignette(package="dplyr")
vignette("dplyr", package="dplyr")
We will continue using the FastQC output data frame. If you’re starting fresh for this lesson, load it with:
bigtab <- read_csv("r-progtidy-files/fastqc.csv")
## Rows: 72 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): grade, test, file
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr
gives us a collection of convenient “verbs” for
manipulating data frames. The name is a contraction of “Data frame
apPLYeR”, as some of these verbs have a similar flavour to
apply
/lapply
/tapply
/etc.
Each verb takes a data frame as input and returns a modified version of it. The idea is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.
# input +--------+ +--------+ +--------+ result
# data |> | verb | |> | verb | |> | verb | -> data
# frame +--------+ +--------+ +--------+ frame
Some of these verbs are covered in our introductory R workshop, so some of this may be familiar already.
Because we used readr
to read the data,
bigtab
is a “tibble”, which is the Tidyverse’s improved
data frame.
bigtab
## # A tibble: 72 × 3
## grade test file
## <chr> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 FAIL Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ℹ 62 more rows
You can also create tibbles explicitly with the tibble
function. One convenient feature is that it only shows a few rows of the
data frame when you print it. If you do want to see the whole table, you
can use as.data.frame
, or use the View
function to view it in a tab.
as.data.frame(bigtab)
View(bigtab)
The n
and width
arguments to
print
can also be used to print more rows or columns
respectively.
print(bigtab, n=100, width=1000)
Say we want to know all the tests that failed. filter
provides a convenient way to get rows matching a query.
filter(bigtab, grade == "FAIL")
## # A tibble: 8 × 3
## grade test file
## <chr> <chr> <chr>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
Something is magic here: we do not have grade
in our
environment, only within the data frame. Arguments to
filter
can use any column of the data frame as a variable.
dplyr
uses this idea heavily, arguments to
dplyr
verbs often behave in a special way.
Rather than filtering, we might instead want to sort the data so the
most important rows are at the top. arrange
sorts a data
frame by one or more columns.
arrange(bigtab, grade)
## # A tibble: 72 × 3
## grade test file
## <chr> <chr> <chr>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
## 9 PASS Basic Statistics Day0.fastq
## 10 PASS Per base sequence quality Day0.fastq
## # ℹ 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 × 3
## grade test file
## <chr> <chr> <chr>
## 1 WARN Per sequence GC content Day0.fastq
## 2 WARN Overrepresented sequences Day0.fastq
## 3 WARN Per sequence GC content Day4.fastq
## 4 WARN Per sequence GC content Day10.fastq
## 5 WARN Overrepresented sequences Day10.fastq
## 6 WARN Per sequence GC content Day15.fastq
## 7 WARN Overrepresented sequences Day15.fastq
## 8 WARN Overrepresented sequences Day20.fastq
## 9 PASS Basic Statistics Day0.fastq
## 10 PASS Per base sequence quality Day0.fastq
## # ℹ 62 more rows
Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:
fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))
scoring
## # A tibble: 3 × 2
## grade score
## <fct> <dbl>
## 1 FAIL 0
## 2 WARN 0.5
## 3 PASS 1
dplyr
has several join functions, including
left_join
, right_join
,
inner_join
, full_join
, and
anti_join
. The difference between these functions is what
happens when there is a row in one data frame without a corresponding
row in the other data frame. inner_join
discards such rows.
full_join
always keeps them, filling in missing data with
NA. left_join
always keeps rows from the first data frame.
right_join
always keeps rows from the second data frame.
anti_join
is a bit different, it gives you rows from the
first data frame that aren’t in the second data frame.
We can use a join to augment a data frame with some extra
information. left_join
is a good default choice for this as
it will never delete rows from the data frame that is being
augmented.
scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 × 4
## grade test file score
## <chr> <chr> <chr> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1
## 2 PASS Per base sequence quality Day0.fastq 1
## 3 PASS Per tile sequence quality Day0.fastq 1
## 4 PASS Per sequence quality scores Day0.fastq 1
## 5 FAIL Per base sequence content Day0.fastq 0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1
## 8 PASS Sequence Length Distribution Day0.fastq 1
## 9 PASS Sequence Duplication Levels Day0.fastq 1
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ℹ 62 more rows
The grade columns act as the keys of the two data frames, allowing corresponding rows in the data frames to be joined together.
One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.
Joins are the key tool for integrating different types of data, based on some common key such as gene ID.
See also: match
and merge
in base
R.
summarize
lets us compute summaries of data.
summarize(scoretab, average_score=mean(score))
## # A tibble: 1 × 1
## average_score
## <dbl>
## 1 0.833
In this, average_score
is used to name the new column.
Any summary function could be used in place of mean
, such
as sum
, median
, sd
.
We really want to summarize the data grouped by file, so we can see
if there are any particularly bad files. This is achieved using the
group_by
“adjective”. group_by
gives the data
frame a grouping using one or more columns, which modifies the
subsequent call to summarize
.
summarize
when used with group_by
turns
each group of rows into a single row:
group_by(scoretab, file)
## # A tibble: 72 × 4
## # Groups: file [6]
## grade test file score
## <chr> <chr> <chr> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1
## 2 PASS Per base sequence quality Day0.fastq 1
## 3 PASS Per tile sequence quality Day0.fastq 1
## 4 PASS Per sequence quality scores Day0.fastq 1
## 5 FAIL Per base sequence content Day0.fastq 0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1
## 8 PASS Sequence Length Distribution Day0.fastq 1
## 9 PASS Sequence Duplication Levels Day0.fastq 1
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ℹ 62 more rows
summarize(group_by(scoretab, file), average_score=mean(score))
## # A tibble: 6 × 2
## file average_score
## <chr> <dbl>
## 1 Day0.fastq 0.833
## 2 Day10.fastq 0.833
## 3 Day15.fastq 0.833
## 4 Day20.fastq 0.792
## 5 Day4.fastq 0.875
## 6 Day7.fastq 0.833
The special function n()
can be used within
summarize
to get the number of rows. (This also works in
mutate
, but is most useful in summarize
.)
summarize(group_by(scoretab, grade), count=n())
## # A tibble: 3 × 2
## grade count
## <chr> <int>
## 1 FAIL 8
## 2 PASS 56
## 3 WARN 8
See also: shortcut functions for simple uses of
summarize count
and distinct
.
See also: table
and tapply
in base R.
Tip: group_by
also affects other verbs,
such as mutate. This is more advanced than we will be going into today.
Grouping can be removed with ungroup
.
We often want to string together a series of dplyr
functions. This is achieved using R’s pipe operator, |>
.
This takes the value on the left, and passes it as the first argument to
the function call on the right. So the previous summarization could be
written:
scoretab |> group_by(grade) |> summarize(count=n())
In older books and web pages you may see %>%
used
instead of |>
. This is a predecessor to the now
standardized R pipe, defined in the magrittr
package.
|>
isn’t limited to dplyr
functions.
It’s an alternative way of writing any R code.
rep(paste("hello", "world"), 5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
"hello" |> paste("world") |> rep(5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
Often for readability we will write a pipeline over several lines:
scoretab |>
group_by(grade) |>
summarize(count=n())
The idea of adding geoms in ggplot2
is rather like the
dplyr
pipe. The only reason pipes aren’t used in
ggplot2
is that it predates dplyr
and the
invention of the pipe.
Write a pipeline using |>
s that starts with
bigtab
, joins the scoring
table, and then
calculates the average score for each test.
A couple more verbs will complete our core vocabulary.
mutate
lets us add or overwrite columns by computing a new
value for them.
mutate(scoretab, doublescore = score*2)
Equivalently:
scoretab |>
mutate(doublescore = score*2)
## # A tibble: 72 × 5
## grade test file score doublescore
## <chr> <chr> <chr> <dbl> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1 2
## 2 PASS Per base sequence quality Day0.fastq 1 2
## 3 PASS Per tile sequence quality Day0.fastq 1 2
## 4 PASS Per sequence quality scores Day0.fastq 1 2
## 5 FAIL Per base sequence content Day0.fastq 0 0
## 6 WARN Per sequence GC content Day0.fastq 0.5 1
## 7 PASS Per base N content Day0.fastq 1 2
## 8 PASS Sequence Length Distribution Day0.fastq 1 2
## 9 PASS Sequence Duplication Levels Day0.fastq 1 2
## 10 WARN Overrepresented sequences Day0.fastq 0.5 1
## # ℹ 62 more rows
Even though this is called “mutate”, it is not literally modifying the input. Rather it is producing a copy of the data frame that has the modification.
The above is equivalent to:
scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2
dplyr’s select
function is for subsetting, renaming, and
reordering columns. Old columns can be referred to by name or
by number.
select(bigtab, test,grade)
## # A tibble: 72 × 2
## test grade
## <chr> <chr>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ℹ 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 × 2
## test grade
## <chr> <chr>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ℹ 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 × 3
## foo bar baz
## <chr> <chr> <chr>
## 1 Day0.fastq Basic Statistics PASS
## 2 Day0.fastq Per base sequence quality PASS
## 3 Day0.fastq Per tile sequence quality PASS
## 4 Day0.fastq Per sequence quality scores PASS
## 5 Day0.fastq Per base sequence content FAIL
## 6 Day0.fastq Per sequence GC content WARN
## 7 Day0.fastq Per base N content PASS
## 8 Day0.fastq Sequence Length Distribution PASS
## 9 Day0.fastq Sequence Duplication Levels PASS
## 10 Day0.fastq Overrepresented sequences WARN
## # ℹ 62 more rows
select
has a special syntax for more complicated column
selections. Read about it here.
For example, you can remove a specific column like this:
select(bigtab, !file)
## # A tibble: 72 × 2
## grade test
## <chr> <chr>
## 1 PASS Basic Statistics
## 2 PASS Per base sequence quality
## 3 PASS Per tile sequence quality
## 4 PASS Per sequence quality scores
## 5 FAIL Per base sequence content
## 6 WARN Per sequence GC content
## 7 PASS Per base N content
## 8 PASS Sequence Length Distribution
## 9 PASS Sequence Duplication Levels
## 10 WARN Overrepresented sequences
## # ℹ 62 more rows
With filter
, arrange
, mutate
,
select
, group_by
, summarize
, and
joins, we have a core vocabulary for manipulating data frames in
pipelines.
tidyr
is the Tidyverse package for getting data frames
to tidy. In a tidy data frame:
I’m not convinced tidyr
is a great package, but the
ideas behind it are very important. dplyr
uses
magic arguments to produce beautiful R code, but in tidyr
it gets a bit confusing. Anyway, let’s work through a toy example:
untidy <- read_csv(
"country, male-young, male-old, female-young, female-old
Australia, 1, 2, 3, 4
New Zealand, 5, 6, 7, 8")
## Rows: 2 Columns: 5
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (4): male-young, male-old, female-young, female-old
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
untidy
## # A tibble: 2 × 5
## country `male-young` `male-old` `female-young` `female-old`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 1 2 3 4
## 2 New Zealand 5 6 7 8
In this example, the first problem is that rows are not distinct
units of observation, there are actually four observations per row. This
is fixed using pivot_longer
. (This operation has in the
past also been called gather
or melt
.)
longer <- pivot_longer(untidy, cols=!country, names_to="group", values_to="cases")
longer
## # A tibble: 8 × 3
## country group cases
## <chr> <chr> <dbl>
## 1 Australia male-young 1
## 2 Australia male-old 2
## 3 Australia female-young 3
## 4 Australia female-old 4
## 5 New Zealand male-young 5
## 6 New Zealand male-old 6
## 7 New Zealand female-young 7
## 8 New Zealand female-old 8
We give a data frame, a specification of columns to use
(cols=
), a column to put the column names in
(names_to=
) and a column to put the values in
(values_to=
). The column specification behaves
like select
. Here we’ve asked to used all columns
except country
. We could equivalently have put
cols=2:5
.
pivot_wider
is the opposite operation, spreading two
columns into many columns, which generally makes data less tidy but is
sometimes necessary. A common application is producing a scatter plot,
where the x and y axes need to be two different columns even though they
measure the same type of thing. Data may also be easier to look at in a
table in spread form.
pivot_wider(longer, names_from=group, values_from=cases)
## # A tibble: 2 × 5
## country `male-young` `male-old` `female-young` `female-old`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 1 2 3 4
## 2 New Zealand 5 6 7 8
pivot_wider(bigtab, names_from=file, values_from=grade)
## # A tibble: 12 × 7
## test Day0.fastq Day4.fastq Day7.fastq Day10.fastq Day15.fastq Day20.fastq
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Basic S… PASS PASS PASS PASS PASS PASS
## 2 Per bas… PASS PASS PASS PASS PASS PASS
## 3 Per til… PASS PASS PASS PASS PASS PASS
## 4 Per seq… PASS PASS PASS PASS PASS PASS
## 5 Per bas… FAIL FAIL FAIL FAIL FAIL FAIL
## 6 Per seq… WARN WARN FAIL WARN WARN FAIL
## 7 Per bas… PASS PASS PASS PASS PASS PASS
## 8 Sequenc… PASS PASS PASS PASS PASS PASS
## 9 Sequenc… PASS PASS PASS PASS PASS PASS
## 10 Overrep… WARN PASS PASS WARN WARN WARN
## 11 Adapter… PASS PASS PASS PASS PASS PASS
## 12 Kmer Co… PASS PASS PASS PASS PASS PASS
In our toy example, we have a further problem that the “group” column
contains two pieces of data. This can be fixed with
separate
. By default separate
splits on any
non-alphanumeric characters, but different separator characters can be
specified.
separate(longer, col=group, into=c("gender","age"))
## # A tibble: 8 × 4
## country gender age cases
## <chr> <chr> <chr> <dbl>
## 1 Australia male young 1
## 2 Australia male old 2
## 3 Australia female young 3
## 4 Australia female old 4
## 5 New Zealand male young 5
## 6 New Zealand male old 6
## 7 New Zealand female young 7
## 8 New Zealand female old 8
All of this would typically be written as a single pipline:
tidied <- untidy |>
pivot_longer(cols=!country, names_to="group", values_to="cases") |>
separate(group, into=c("gender","age"))
Finally, we mention that pivot_longer
has features we
haven’t explored, and it is actually possible to do this in one step. See the tidyr
vignettes.
pivot_longer(
untidy, cols=!country,
names_to=c("gender","age"), names_sep="-", values_to="cases")
## # A tibble: 8 × 4
## country gender age cases
## <chr> <chr> <chr> <dbl>
## 1 Australia male young 1
## 2 Australia male old 2
## 3 Australia female young 3
## 4 Australia female old 4
## 5 New Zealand male young 5
## 6 New Zealand male old 6
## 7 New Zealand female young 7
## 8 New Zealand female old 8
Advanced: tidyr
has a number of other useful data
frame manipulations. For completeness, we mention nest
.
This creates a list column in a tibble, the list column
containing one tibble per row. Yo, Hadley put tibbles in your tibble so
you can dplyr while you dplyr. The inverse operation is
unnest
. unnest
is rather like the
bind_rows
function we saw earlier. nest
and
unnest
go well with creating or modifying list columns in a
mutate
with lapply
(or the purrr
package).
# Advanced
nested <- nest(tidied, data=c(gender, age, cases))
nested
nested$data
unnest(nested, data)
You receive data on a set of points. The points are in two dimensions
(dim
), and each point has x and y coordinates.
Unfortunately it looks like this:
df <- read_csv(
"dim, E_1, E_2, M_1, M_2, M_3, M_4, M_5
x, 2, 4, 1, 2, 3, 4, 5
y, 4, 4, 2, 1, 1, 1, 2")
## Rows: 2 Columns: 8
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): dim
## dbl (7): E_1, E_2, M_1, M_2, M_3, M_4, M_5
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Tidy the data by pivoting longer all of the columns except
dim
. What does each row now represent?
We want to plot the points as a scatter-plot, using either
plot
or ggplot
. Pivot the long data wider so
that this is possible. Now what do the rows represent?
The data seems to be divided into “E” and “M” points. How could we make a column containing “E”s and “M”s?
We will now put this all together by looking at a small-scale gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).
This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.
Recall the diagram from the “R for Data Science” book. We will here focus on the “import”, “tidy”, “transform”, and “visualize” steps.
Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.
# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-progtidy-files/read-counts.csv")
## Rows: 45 Columns: 41
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Feature
## dbl (40): WT:1, WT:2, WT:3, WT:4, WT:5, WT:6, WT:7, WT:8, WT:9, WT:10, SET1:...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
counts <- counts_untidy |>
pivot_longer(cols=!Feature, names_to="sample", values_to="count") |>
separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) |>
mutate(
sample = factor(sample, levels=unique(sample)),
strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6"))
) |>
filter(time >= 2) |>
select(sample, strain, time, gene=Feature, count)
summary(counts)
## sample strain time gene
## WT:2 : 45 WT :405 Min. : 2 Length:1620
## WT:3 : 45 SET1 :405 1st Qu.: 4 Class :character
## WT:4 : 45 RRP6 :405 Median : 6 Mode :character
## WT:5 : 45 SET1-RRP6:405 Mean : 6
## WT:6 : 45 3rd Qu.: 8
## WT:7 : 45 Max. :10
## (Other):1350
## count
## Min. : 0
## 1st Qu.: 172
## Median : 666
## Mean : 3612
## 3rd Qu.: 2548
## Max. :143592
##
Time point 1 was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they are at 15 minute intervals.
ggplot(counts, aes(x=sample, y=count)) +
geom_boxplot() +
coord_flip()
We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box.
Perhaps a log transformation is in order.
ggplot(counts, aes(x=sample, y=log2(count))) +
geom_boxplot() +
coord_flip()
## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).
ggplot(counts, aes(x=log2(count), group=sample)) +
geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (`stat_density()`).
Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.
The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log-transform the data. We add a small constant when log-transforming so that zeros do not become negative infinity (as if adding 1 further read count to an average sample).
normalizer <- counts |>
filter(gene == "SRP68") |>
select(sample, norm=count)
moderation <- 1/mean(normalizer$norm)
counts_norm <- counts |>
left_join(normalizer, by="sample") |>
mutate(
norm_count = count/norm,
log_norm_count = log2(norm_count+moderation)
)
ggplot(counts_norm, aes(x=sample, y=log_norm_count)) +
geom_boxplot() +
coord_flip()
Note: Here all of the genes are potentially
differentially expressed. For a full RNA-Seq data set we would probably
assume most genes aren’t differentially expressed, so rather than
nominating a normalizing gene we would normalize by total library size
with, typically with a slight adjustment using the “TMM” method as
calculated by calcNormFactors
in edgeR
:
# For a full sized RNA-Seq dataset:
library(edgeR)
mat <- counts_untidy |>
column_to_rownames("Feature") |>
as.matrix()
adjusted_lib_sizes <- colSums(mat) * calcNormFactors(mat)
normalizer_by_tmm <- tibble(
sample=names(adjusted_lib_sizes),
norm=adjusted_lib_sizes)
(In practice, with a full size RNA-Seq dataset I would use the
edgeR
function cpm
, which performs the same
type of normalization and log transformation as above.)
Whatever your data is measuring, two common themes are:
Measurement is usually normalized relative to something (housekeeping gene, overall library size, microscope image resolution, a ruler in a photograph, calibration run of a machine, …).
If the measurement is by nature a positive number, consider log transformation. Two further indications this is appropriate are if you expect the effects of treatmeants to be multiplicative, or if the noise level scales with the magnitude of the measurement.
The data is now suitably transformed and normalized for
visualization. ggplot2
gives us a lot of freedom to
juxtapose information from different columns.
Heatmaps using geom_tile
and the use of facets allow the
entire dataset to be viewed at once.
ggplot(counts_norm, aes(x=sample, y=gene, fill=log_norm_count)) +
geom_tile() +
scale_fill_viridis_c() +
theme_minimal() +
theme(axis.text.x=element_text( # Vertical text on x axis
angle=90,vjust=0.5,hjust=1))
Facetting the heatmap makes it easier to understand.
ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) +
geom_tile() +
facet_grid(~ strain) +
scale_fill_viridis_c() +
theme_minimal()
We are interested in the interaction of strain and time, so we can instead set these as the x and y axes for each gene.
ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) +
geom_tile() +
facet_wrap(~ gene) +
scale_fill_viridis_c() +
theme_minimal()
A famous paper by Cleveland and McGill (1984) ranks color last in terms of perceptual accuracy. Perhaps a different type of plot will be better?
A more traditional way of examining the interaction between two factors is a line graph called an “interaction plot”. We can also give each gene a separate scale.
ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) +
geom_line() +
facet_wrap(~ gene, scale="free")
Hint:
intToUtf8(utf8ToInt("xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh")-3)
log_norm_count
.sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1
## [5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [9] ggplot2_3.4.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] highr_0.10 bslib_0.4.2 compiler_4.2.3 pillar_1.9.0
## [5] jquerylib_0.1.4 tools_4.2.3 bit_4.0.5 digest_0.6.31
## [9] viridisLite_0.4.1 timechange_0.2.0 jsonlite_1.8.4 evaluate_0.20
## [13] lifecycle_1.0.3 gtable_0.3.3 pkgconfig_2.0.3 rlang_1.1.0
## [17] cli_3.6.1 parallel_4.2.3 yaml_2.3.7 xfun_0.38
## [21] fastmap_1.1.1 withr_2.5.0 knitr_1.42 generics_0.1.3
## [25] vctrs_0.6.1 sass_0.4.5 hms_1.1.3 bit64_4.0.5
## [29] grid_4.2.3 tidyselect_1.2.0 glue_1.6.2 R6_2.5.1
## [33] fansi_1.0.4 vroom_1.6.1 rmarkdown_2.21 farver_2.1.1
## [37] tzdb_0.3.0 magrittr_2.0.3 scales_1.2.1 htmltools_0.5.5
## [41] colorspace_2.1-0 labeling_0.4.2 utf8_1.2.3 stringi_1.7.12
## [45] munsell_0.5.0 cachem_1.0.7 crayon_1.5.2
References