Load packages
library(MASS) # ginv -- coefficient estimation
library(splines) # ns, bs -- spline curves
library(multcomp) # glht -- linear hypotheses
library(emmeans) # estimated marginal means
library(edgeR) # cpm, etc -- RNA-Seq normalization
library(limma) # lmFit, etc -- fitting many models
library(tidyverse) # working with data frames, plotting
Much of what we will be using is built into R without loading any
packages.
Vectors and
matrices
Vector
operations
a <- c(3,4)
b <- c(5,6)
length(a)
## [1] 2
R performs operations elementwise:
a + b
## [1] 8 10
a * b
## [1] 15 24
We will be using the dot product a lot. This is:
sum(a*b)
## [1] 39
t(a) %*% b
## [,1]
## [1,] 39
The geometric length of a vector is (by Pythagoras, aka
Euclidean distance):
sqrt(sum(a*a))
## [1] 5
Matrix
operations
We can create a matrix with matrix
, rbind
(row bind), or cbind
(column bind).
matrix(c(1,2,3,4), nrow=2, ncol=2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
rbind(c(1,3), c(2,4))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
cbind(c(1,2), c(3,4))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
X <- rbind(
c(1,0),
c(1,0),
c(1,1),
c(1,1))
X
## [,1] [,2]
## [1,] 1 0
## [2,] 1 0
## [3,] 1 1
## [4,] 1 1
class(X)
## [1] "matrix" "array"
The matrix transpose is obtained with t
.
t(X)
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 0 0 1 1
Matrix multiplication is performed with %*%
. The dot
product of each row of the left hand side matrix and each column of the
right hand side matrix is calculated. %*%
treats a vector
as either a single column or single row matrix as will make sense for
matrix multiplication. Actually all we need today is to multiply a
matrix by a vector, in which case we get the dot product of each row of
the matrix with the vector.
X %*% a
## [,1]
## [1,] 3
## [2,] 3
## [3,] 7
## [4,] 7
as.vector(X %*% a)
## [1] 3 3 7 7
Challenge - use a dot product to calculate
The following dot product is an elaborate way to retrieve
x[2]
:
x <- c(10,20,30,40)
weights <- c(0,1,0,0) # <-- modify this line
sum(weights*x)
Modify weights
in the above to calculate different
quantities:
A. x[3]-x[2]
B. The mean of all four values.
Single numerical
predictor
The age (year) and height (cm) of 10 people has been measured. We
want a model that can predict height based on age.
people <- read_csv(
"age, height
10, 131
14, 147
16, 161
9, 136
16, 170
15, 160
15, 153
21, 187
9, 145
21, 195")
ggplot(people, aes(x=age, y=height)) + geom_point()
fit <- lm(height ~ age, data=people)
fit
##
## Call:
## lm(formula = height ~ age, data = people)
##
## Coefficients:
## (Intercept) age
## 92.612 4.513
Coefficients are extracted with coef
:
coef(fit)
## (Intercept) age
## 92.611502 4.512911
The residual standard deviation is extracted with
sigma
:
sigma(fit)
## [1] 7.263536
Behind the scenes a matrix of predictors has been produced from the
mysterious notation ~ age
. We can examine it
explicitly:
model.matrix(fit)
model.matrix
can be used without first calling
lm
.
model.matrix(~ age, data=people)
## (Intercept) age
## 1 1 10
## 2 1 14
## 3 1 16
## 4 1 9
## 5 1 16
## 6 1 15
## 7 1 15
## 8 1 21
## 9 1 9
## 10 1 21
## attr(,"assign")
## [1] 0 1
n=10 observations minus p=2 columns in the model matrix leaves 8
residual degrees of freedom:
df.residual(fit)
## [1] 8
Prediction
predict
predicts. By default it produces predictions on
the original dataset.
predict(fit)
## 1 2 3 4 5 6 7 8
## 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826
## 9 10
## 133.2277 187.3826
predict(fit, interval="confidence")
## fit lwr upr
## 1 137.7406 129.8100 145.6712
## 2 155.7923 150.4399 161.1446
## 3 164.8181 159.2250 170.4111
## 4 133.2277 124.3009 142.1545
## 5 164.8181 159.2250 170.4111
## 6 160.3052 154.9836 165.6267
## 7 160.3052 154.9836 165.6267
## 8 187.3826 177.6105 197.1547
## 9 133.2277 124.3009 142.1545
## 10 187.3826 177.6105 197.1547
(Note: This is a 95% confidence interval for uncertainty in the
model. Individuals additionally vary with standard
deviation sigma.)
We can also calculate predictions manually.
# Prediction for a 15-year old
x <- c(1, 15)
beta <- coef(fit)
sum(x * beta)
## [1] 160.3052
# Prediction for all original data
X <- model.matrix(fit)
as.vector( X %*% beta )
## [1] 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826
## [9] 133.2277 187.3826
predict
can be used with new data.
new_people <- tibble(age=5:25)
predict(fit, new_people)
## 1 2 3 4 5 6 7 8
## 115.1761 119.6890 124.2019 128.7148 133.2277 137.7406 142.2535 146.7664
## 9 10 11 12 13 14 15 16
## 151.2793 155.7923 160.3052 164.8181 169.3310 173.8439 178.3568 182.8697
## 17 18 19 20 21
## 187.3826 191.8955 196.4085 200.9214 205.4343
new_predictions <- cbind(
new_people,
predict(fit, new_people, interval="confidence"))
ggplot() +
geom_ribbon(aes(x=age, ymin=lwr, ymax=upr), data=new_predictions, fill="grey") +
geom_line(aes(x=age, y=fit), data=new_predictions, color="blue") +
geom_point(aes(x=age, y=height), data=people) +
labs(y="height (cm)", x="age (year)",
subtitle="Ribbon shows 95% confidence interval of the model")
If you have ever used geom_smooth
, it should now be a
little less mysterious.
ggplot(people, aes(x=age, y=height)) + geom_smooth(method="lm") + geom_point()
## `geom_smooth()` using formula = 'y ~ x'
Residuals
The residuals are the differences between the actual and predicted
values.
residuals(fit)
## 1 2 3 4 5 6 7
## -6.7406103 -8.7922535 -3.8180751 2.7723005 5.1819249 -0.3051643 -7.3051643
## 8 9 10
## -0.3826291 11.7723005 7.6173709
There should be no remaining relationship between predictions and the
residuals (or between any individual predictors and the residual).
plot(predict(fit), residuals(fit))
A Q-Q (quantile-quantile) plot sorts the residuals and compares them
to what would be expected from a normal distribution.
qqnorm(residuals(fit))
qqline(residuals(fit))
Ideally points would lie close to the line, but deviations are not a
disaster. Our coefficient estimates will tend toward normally
distributed errors even if the data does not, due to the Central Limit
Theorem. Wild outliers should be investigated, as they may have a large
effect on the model. We will see further examples of things to look for
in a Q-Q plot in section 6.
plot(fit)
produces a series of more sophisticated
diagnostic plots.
plot(fit)
Single factor
predictor, two levels
Consider a simple experiment where some outcome is measured for an
untreated and a treated group. This can be viewed as a one-way ANalysis
Of VAriance (ANOVA) experiment. (This is one of two senses in which the
term ANOVA will be used today.)
outcomes <- read_csv(
"group, outcome
untreated, 4.98
untreated, 5.17
untreated, 5.66
untreated, 4.87
treated, 8.07
treated, 11.02
treated, 9.91")
outcomes$group <- factor(outcomes$group, c("untreated", "treated"))
outfit <- lm(outcome ~ group, data=outcomes)
outfit
##
## Call:
## lm(formula = outcome ~ group, data = outcomes)
##
## Coefficients:
## (Intercept) grouptreated
## 5.170 4.497
df.residual(outfit)
## [1] 5
sigma(outfit)
## [1] 0.9804353
model.matrix(outfit)
## (Intercept) grouptreated
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 1
## 6 1 1
## 7 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
How coefficients are
estimated
Coefficients can be estimated from responses by multiplying by the
“Moore-Penrose generalized inverse” of X. It can be useful to examine
this matrix to work out exactly what a fit is doing. Each row shows how
the corresponding coefficient is estimated.
X <- model.matrix(outfit)
y <- outcomes$outcome
round(ginv(X), 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.25 0.25 0.25 0.25 0.000 0.000 0.000
## [2,] -0.25 -0.25 -0.25 -0.25 0.333 0.333 0.333
ginv(X) %*% y
## [,1]
## [1,] 5.170000
## [2,] 4.496667
Here we can see the first coefficient is the average of the
“untreated” samples, and the second is the average of the “treated”
samples minus that average of the “untreated” samples.
( y
contains noise, assumed to be identically normally
distributed for each observation. Transformation of this noise by
ginv(X)
tells us the distribution of errors in the
coefficients (see vcov()
). This can be further propagated
to give the distribution of errors in predictions, and in other linear
combinations of coefficients. )
Class exercise - the meanings of coefficients
We now consider the formula outcome ~ 0 + group
.
Examine the model matrix that will be used:
model.matrix(~ 0 + group, data=outcomes)
What column has been removed because 0 +
was
used?
R has responded to this by being a bit clever when representing
the factor. What column has been added?
The mean of the untreated group is 5.2, the mean of the treated
group is 9.7, and the difference between them is 4.5. Without using
lm
, what values should the coefficients have to best fit
the data?
Now perform the actual linear model fit:
outfit2 <- lm(outcome ~ 0 + group, data=outcomes)
- Looking at the
residuals
and sigma
, does
the new model fit the data better or worse than the original?
Testing a
hypothesis
Besides data with categorical predictors, the term ANOVA is used to
refer to the use of the F test. Significance is judged based on
comparing the Residual Sums of Squares of two models. We fit a model
representing a null hypothesis. This model formula must nest within our
original model formula: any prediction it can make must also be possible
to be made by the original model formula. We compare the models using
the anova
function.
outfit0 <- lm(outcome ~ 1, data=outcomes)
anova(outfit0, outfit)
## Analysis of Variance Table
##
## Model 1: outcome ~ 1
## Model 2: outcome ~ group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 39.469
## 2 5 4.806 1 34.663 36.06 0.001839 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Warning: This is not the only way to use the
anova( )
function, but I think it is the safest way. Once
we start using multiple predictors, the meaning of the output from
anova
with a single model is likely to be not quite what
you want, read the documentation carefully.
drop1(fit, test="F")
should usually be used instead.
summary( )
also outputs p-values. Too many p-values,
summary( )
doesn’t respect the hierarchy of terms in the
model. The p-value for dropping the intercept is nonsense. The p-values
are based on a t statistic, where F=t^2.
summary(outfit)
##
## Call:
## lm(formula = outcome ~ group, data = outcomes)
##
## Residuals:
## 1 2 3 4 5 6 7
## -1.900e-01 7.992e-16 4.900e-01 -3.000e-01 -1.597e+00 1.353e+00 2.433e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1700 0.4902 10.546 0.000132 ***
## grouptreated 4.4967 0.7488 6.005 0.001839 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9804 on 5 degrees of freedom
## Multiple R-squared: 0.8782, Adjusted R-squared: 0.8539
## F-statistic: 36.06 on 1 and 5 DF, p-value: 0.001839
confint( )
tells us not only that the difference between
groups is non-zero but places a confidence interval on the difference.
If the p-value were 0.05, the confidence interval would just touch zero.
Whenever we reject a hypothesis that a single coefficient is zero, we
may also conclude that we know its sign.
confint(outfit)
## 2.5 % 97.5 %
## (Intercept) 3.909855 6.430145
## grouptreated 2.571764 6.421569
These results exactly match those of a t-test.
t.test(outcomes$outcome[5:7], outcomes$outcome[1:4], var.equal=TRUE)
##
## Two Sample t-test
##
## data: outcomes$outcome[5:7] and outcomes$outcome[1:4]
## t = 6.005, df = 5, p-value = 0.001839
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.571764 6.421569
## sample estimates:
## mean of x mean of y
## 9.666667 5.170000
Challenge - does height change with age?
Return to the people
dataset.
Can we reject the hypothesis that height is unrelated to
age?
Compare the result to the outcome of a correlation test using
cor.test( )
.
What is the 95% confidence interval on the slope, in cm per
year?
Multiple factors, many
levels
Particle sizes of PVC plastic produced by a machine are measured. The
machine is operated by three different people, and eight different
batches of resin are used. Two measurements are made for each
combination of these two experimental factors.
(This example is adapted from a data-set in the faraway
package.)
pvc <- read_csv("r-linear-files/pvc.csv")
## Rows: 48 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): operator, resin
## dbl (1): psize
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)
ggplot(pvc, aes(x=resin, y=psize)) + geom_point() + facet_grid(~operator)
Main effects
pvcfit1 <- lm(psize ~ operator + resin, data=pvc)
summary(pvcfit1)
##
## Call:
## lm(formula = psize ~ operator + resin, data = pvc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9500 -0.6125 -0.0167 0.4063 3.6833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.2396 0.5226 69.345 < 2e-16 ***
## operatorBob -0.2625 0.4048 -0.648 0.52059
## operatorCarl -1.5063 0.4048 -3.721 0.00064 ***
## resinR2 -1.0333 0.6610 -1.563 0.12630
## resinR3 -5.8000 0.6610 -8.774 1.13e-10 ***
## resinR4 -6.1833 0.6610 -9.354 2.11e-11 ***
## resinR5 -4.8000 0.6610 -7.261 1.09e-08 ***
## resinR6 -5.4500 0.6610 -8.245 5.46e-10 ***
## resinR7 -2.9167 0.6610 -4.412 8.16e-05 ***
## resinR8 -0.1833 0.6610 -0.277 0.78302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 38 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8262
## F-statistic: 25.82 on 9 and 38 DF, p-value: 1.474e-13
confint(pvcfit1)
## 2.5 % 97.5 %
## (Intercept) 35.181635 37.2975319
## operatorBob -1.081983 0.5569834
## operatorCarl -2.325733 -0.6867666
## resinR2 -2.371544 0.3048775
## resinR3 -7.138211 -4.4617892
## resinR4 -7.521544 -4.8451225
## resinR5 -6.138211 -3.4617892
## resinR6 -6.788211 -4.1117892
## resinR7 -4.254877 -1.5784559
## resinR8 -1.521544 1.1548775
This model assumes the influence of the two factors is additive, the
model only contains the “main effects”. The meanings of the coefficients
are:
- “(Intercept)” is the particle size for Alice and R1
- “operatorBob” is the step in particle size going from Alice to
Bob
- “operatorCarl” is the step in particle size going from Alice to
Carl
- “resinR2” is the step in particle size going from R1 to R2
- “resinR3” is the step in particle size going from R1 to R3
- (etc)
We can use anova( )
to test if there is evidence either
of these main effects is important. For example, to test if there is
evidence that the operator is important to the outcome we can test
pvcfit1
against a model in which operator
is
dropped:
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
##
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 70.533
## 2 38 49.815 2 20.718 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop1( )
function with test="F"
can be
used as a shorthand to test dropping each term that can be dropped from
the model formula.
drop1(pvcfit1, test="F")
## Single term deletions
##
## Model:
## psize ~ operator + resin
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 49.82 21.782
## operator 2 20.718 70.53 34.474 7.902 0.00135 **
## resin 7 283.946 333.76 99.083 30.943 8.111e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interactions
We can ask if there is any interaction between the two factors. For
example Carl might produce particularly small particles with R5. An
additive model doesn’t allow for this.
pvcfit2 <- lm(psize ~ operator + resin + operator:resin, data=pvc)
# or
pvcfit2 <- lm(psize ~ operator*resin, data=pvc)
pvcfit2
##
## Call:
## lm(formula = psize ~ operator * resin, data = pvc)
##
## Coefficients:
## (Intercept) operatorBob operatorCarl
## 36.25 -0.85 -0.95
## resinR2 resinR3 resinR4
## -1.10 -5.55 -6.55
## resinR5 resinR6 resinR7
## -4.40 -6.05 -3.35
## resinR8 operatorBob:resinR2 operatorCarl:resinR2
## 0.55 1.05 -0.85
## operatorBob:resinR3 operatorCarl:resinR3 operatorBob:resinR4
## -0.20 -0.55 1.20
## operatorCarl:resinR4 operatorBob:resinR5 operatorCarl:resinR5
## -0.10 0.40 -1.60
## operatorBob:resinR6 operatorCarl:resinR6 operatorBob:resinR7
## 1.30 0.50 0.45
## operatorCarl:resinR7 operatorBob:resinR8 operatorCarl:resinR8
## 0.85 0.50 -2.70
This model allows for interactions between the two factors. There are
enough predictors in the model matrix that each combination of levels
can take on a distinct value. So we now have
- “(Intercept)” is particle size for Alice and R1
- “operatorBob” is the step in particle size from Alice to Bob, for
R1
- “operatorCarl” is the step in particle size from Alice to Carl, for
R1
- “resinR2” is the step in particle size from R1 to R2, for Alice
- (etc)
- “operatorBob:resinR2” is the particle size for Bob and R2, relative
to
(Intercept)+operatorBob+resinR2
.
- (etc)
anova(pvcfit1, pvcfit2)
## Analysis of Variance Table
##
## Model 1: psize ~ operator + resin
## Model 2: psize ~ operator * resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 49.815
## 2 24 35.480 14 14.335 0.6926 0.7599
We do not reject the main effects model. Our data does not provide
evidence that the interaction model is needed. Until fresh data demands
that we need an interaction model, we will proceed with the main effects
model.
Note: Ideally we would have checked for evidence of an interaction
effect before examining the main effects model in the previous
section.
Contrasts and
confidence intervals
anova( )
lets us test if a particular factor or
interaction is needed at all, and summary( )
allows us to
see if any levels of a factor differ from the first level. However we
may wish to perform different comparisons of the levels of a factor –
this is called a “contrast”. We might also want to test some more
complicated combination of coefficients such as a difference between two
hypothetical individuals. In general this is called a “linear
hypothesis” or a “general linear hypothesis”.
Say we want to compare Bob and Carl’s particle sizes. We will use the
pvcfit1
model.
coef(pvcfit1)
## (Intercept) operatorBob operatorCarl resinR2 resinR3 resinR4
## 36.2395833 -0.2625000 -1.5062500 -1.0333333 -5.8000000 -6.1833333
## resinR5 resinR6 resinR7 resinR8
## -4.8000000 -5.4500000 -2.9166667 -0.1833333
K <- rbind(Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))
K %*% coef(pvcfit1)
## [,1]
## Carl_vs_Bob -1.24375
This is the estimated difference in particle size between Carl and
Bob, but can we trust it? The glht
function from
multcomp
can tell us. GLHT stands for General Linear
Hypothesis Test.
library(multcomp)
result <- glht(pvcfit1, K)
result
##
## General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate
## Carl_vs_Bob == 0 -1.244
summary(result)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Quantile = 2.0244
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Carl_vs_Bob == 0 -1.2437 -2.0632 -0.4243
glht
can test multiple hypotheses at once. By default it
applies a multiple testing correction when doing so. This is a
generalization of Tukey’s Honestly Significant Differences.
K <- rbind(
Bob_vs_Alice = c(0, 1,0, 0,0,0,0,0,0,0),
Carl_vs_Alice = c(0, 0,1, 0,0,0,0,0,0,0),
Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))
result <- glht(pvcfit1, K)
summary(result)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.79431
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00177 **
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.01056 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Quantile = 2.4397
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Bob_vs_Alice == 0 -0.2625 -1.2501 0.7251
## Carl_vs_Alice == 0 -1.5063 -2.4938 -0.5187
## Carl_vs_Bob == 0 -1.2437 -2.2313 -0.2562
plot(result)
We can also turn off the multiple testing correction.
summary(result, test=adjusted("none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.52059
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00064 ***
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
A reasonable compromise between these extremes is Benjamini and
Hochberg’s False Discovery Rate (FDR) correction.
summary(result, test=adjusted("fdr"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.52059
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00192 **
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00587 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- fdr method)
Finally, we can ask if any of the linear combinations is
non-zero, i.e. whether the model with all three constraints applied can
be rejected. This is equivalent to the anova( )
tests we
have done earlier. (Note that while we have three constraints, the
degrees of freedom reduction is 2, as any 2 of the constraints are
sufficient. This makes me uneasy as it is reliant on numerical accuracy,
better to just use any two of the constraints.)
summary(result, test=Ftest())
##
## General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate
## Bob_vs_Alice == 0 -0.2625
## Carl_vs_Alice == 0 -1.5063
## Carl_vs_Bob == 0 -1.2437
##
## Global Test:
## F DF1 DF2 Pr(>F)
## 1 7.902 2 38 0.00135
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
##
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 70.533
## 2 38 49.815 2 20.718 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This demonstrates that the two methods of testing hypotheses–with the
ANOVA test and with linear hypotheses–are equivalent.
Heteroscedasticity
ggplot(pvc, aes(x=resin, y=residuals(pvcfit1))) +
geom_point() + geom_hline(yintercept=0) + facet_grid(~operator)
Our assumption that the residual noise is uniformly normally
distributed may not hold. Carl’s data seems to have greater standard
deviation than Alice or Bob’s. When comparing Alice and Bob’s results,
including Carl’s data in the model may alter the outcome.
Challenge - examine contrasts
Using the pvcfit1
model, construct linear hypotheses to
see if the effect of:
- R8 is different to R4
- R2 is different to R1
Easier ways to
specify contrasts
So far we have been manually constructing linear hypotheses. The
multcomp
package automates this for some common situations.
To compare all pairs of levels within a factor, use the mcp
function, giving the factor to test as the name of an argument and
specifying to test “Tukey” contrasts:
result <- glht(pvcfit1, mcp(operator="Tukey"))
To compare the first level of a factor to all other levels, specify
to test “Dunnett” contrasts:
result <- glht(pvcfit1, mcp(operator="Dunnett"))
The linear hypotheses actually used can be inspected with
result$linfct
.
The emmeans
package also automates many common
comparisons.
library(emmeans)
# Mean of the predictions for a set of typical individuals.
# Not the mean of your input data!
operator_means <- emmeans(pvcfit1, ~ operator)
operator_means
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.286 38 32.4 33.5
## Bob 32.7 0.286 38 32.1 33.3
## Carl 31.4 0.286 38 30.9 32.0
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
# Differences in the mean of predictions between groups.
operator_pairwise <- emmeans(pvcfit1, pairwise ~ operator)
operator_pairwise
## $emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.286 38 32.4 33.5
## Bob 32.7 0.286 38 32.1 33.3
## Carl 31.4 0.286 38 30.9 32.0
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## Alice - Bob 0.263 0.405 38 0.648 0.7944
## Alice - Carl 1.506 0.405 38 3.721 0.0018
## Bob - Carl 1.244 0.405 38 3.072 0.0107
##
## Results are averaged over the levels of: resin
## P value adjustment: tukey method for comparing a family of 3 estimates
confint(operator_pairwise$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Alice - Bob 0.263 0.405 38 -0.725 1.25
## Alice - Carl 1.506 0.405 38 0.519 2.49
## Bob - Carl 1.244 0.405 38 0.257 2.23
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
operator_vs_control <- emmeans(pvcfit1, trt.vs.ctrl ~ operator)
operator_vs_control
## $emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.286 38 32.4 33.5
## Bob 32.7 0.286 38 32.1 33.3
## Carl 31.4 0.286 38 30.9 32.0
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## Bob - Alice -0.263 0.405 38 -0.648 0.7349
## Carl - Alice -1.506 0.405 38 -3.721 0.0013
##
## Results are averaged over the levels of: resin
## P value adjustment: dunnettx method for 2 tests
confint(operator_vs_control$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Bob - Alice -0.263 0.405 38 -1.20 0.672
## Carl - Alice -1.506 0.405 38 -2.44 -0.571
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
## Conf-level adjustment: dunnettx method for 2 estimates
emmeans
does not calculate means from your data
directly. Instead it calculates the mean of the predictions for a set of
typical individuals (the “reference grid”). If individuals in your
groups are affected by known variables, and you include these variables
as predictors in your model, emmeans
will naturally adjust
for these to provide fair comparisons between groups.
By default emmeans
will use a reference grid with one
individual for each combination of the levels of the factors in your
model. For any numeric predictor, the average of the input data is used.
It is up to you to decide if this is sensible! This approach provides
reasonable means even in models with interaction terms.
emmeans
is a super cool package, and has great
documentation. In practice, this is usually our preferred option.
Gene expression
example
Tooth growth in mouse embryos is studied using RNA-Seq. The RNA
expression levels of several genes are examined in the cells that form
the upper and lower first molars, in eight individual mouse embryos that
have been dissected after different times of embryo development. The
measurements are in terms of “Reads Per Million”, essentially the
fraction of RNA in each sample belonging to each gene, times 1
million.
(This data was extracted from ARCHS4 (https://amp.pharm.mssm.edu/archs4/). In the Gene
Expression Omnibus it is entry GSE76316. The sample descriptions in GEO
seem to be out of order, but reading the associated paper and the genes
they talk about I think I have the correct order of
samples!)
teeth <- read_csv("r-linear-files/teeth.csv")
## Rows: 16 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sample, mouse, tooth
## dbl (5): day, gene_ace, gene_smoc1, gene_pou3f3, gene_wnt2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teeth$tooth <- factor(teeth$tooth, levels=c("lower","upper"))
teeth$mouse <- factor(teeth$mouse)
It will be convenient to have a quick way to examine different genes
and different models with this data.
# A convenience to examine different model fits
more_data <- expand.grid(
day=seq(14.3,18.2,by=0.01),
tooth=factor(c("lower","upper"), levels=c("lower","upper")))
look <- function(y, fit=NULL) {
p <- ggplot(teeth,aes(x=day,group=tooth))
if (!is.null(fit)) {
more_ci <- cbind(
more_data,
predict(fit, more_data, interval="confidence"))
p <- p +
geom_ribbon(aes(ymin=lwr,ymax=upr), data=more_ci, alpha=0.1) +
geom_line(aes(y=fit,color=tooth), data=more_ci)
}
p + geom_point(aes(y=y,color=tooth)) +
labs(y=deparse(substitute(y)))
}
# Try it out
look(teeth$gene_ace)
We could treat day
as a categorical variable, as in the
previous section. However let us treat it as numerical, and see where
that leads.
Transformation
Ace gene
acefit <- lm(gene_ace ~ tooth + day, data=teeth)
look(teeth$gene_ace, acefit)
Two problems:
- The actual data appears to be curved, our straight lines are not a
good fit.
- The predictions fall below zero, a physical impossibility.
In this case, log transformation of the data will solve both these
problems.
log2_acefit <- lm( log2(gene_ace) ~ tooth + day, data=teeth)
look(log2(teeth$gene_ace), log2_acefit)
Various transformations of y are possible. Log transformation is
commonly used in the context of gene expression. Square root
transformation can also be appropriate with nicely behaved count data
(technically, if the errors follow a Poisson distribution). This gene
expression data is ultimately count based, but is overdispersed compared
to the Poisson distribution so square root transformation isn’t
appropriate in this case. The Box-Cox transformations provide a spectrum
of further options.
Pou3f3 gene
In the case of the Pou3f3 gene, the log transformation is even more
important. It looks like gene expression changes at different rates in
the upper and lower molars, that is there is a significant interaction
between tooth
and day
.
pou3f3fit0 <- lm(gene_pou3f3 ~ tooth + day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit0)
pou3f3fit1 <- lm(gene_pou3f3 ~ tooth * day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit1)
anova(pou3f3fit0, pou3f3fit1)
## Analysis of Variance Table
##
## Model 1: gene_pou3f3 ~ tooth + day
## Model 2: gene_pou3f3 ~ tooth * day
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 4849.7
## 2 12 415.2 1 4434.5 128.15 9.234e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(pou3f3fit1)["toothupper:day",]
## 2.5 % 97.5 %
## -34.65685 -23.46937
The slopes of the lines confidently differ by at least 23.5 RPM per
day.
Examining the residuals reveals a further problem: larger expression
values are associated with larger residuals.
look(residuals(pou3f3fit1))
plot(predict(pou3f3fit1), residuals(pou3f3fit1))
qqnorm(residuals(pou3f3fit1))
qqline(residuals(pou3f3fit1))
Log transformation both removes the interaction and makes the
residuals more uniform (except for one outlier).
log2_pou3f3fit0 <- lm(log2(gene_pou3f3) ~ tooth + day, data=teeth)
log2_pou3f3fit1 <- lm(log2(gene_pou3f3) ~ tooth * day, data=teeth)
anova(log2_pou3f3fit0, log2_pou3f3fit1)
## Analysis of Variance Table
##
## Model 1: log2(gene_pou3f3) ~ tooth + day
## Model 2: log2(gene_pou3f3) ~ tooth * day
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 0.56714
## 2 12 0.56579 1 0.0013576 0.0288 0.8681
confint(log2_pou3f3fit1)["toothupper:day",]
## 2.5 % 97.5 %
## -0.2225600 0.1903981
The difference of the slopes of log2 gene expression between lower
and upper molars is confidently between -0.22 and 0.19.
look(log2(teeth$gene_pou3f3), log2_pou3f3fit0)
qqnorm(residuals(log2_pou3f3fit0))
qqline(residuals(log2_pou3f3fit0))
Curve fitting
Smoc1 gene
log2_smoc1fit <- lm(log2(gene_smoc1) ~ tooth + day, data=teeth)
look(log2(teeth$gene_smoc1), log2_smoc1fit)
In this case, log transformation does not remove the curve. If you
think this is a problem for linear models, you are mistaken!
With a little feature engineering we can fit a quadratic curve.
Calculations can be included in the formula if wrapped in
I( )
:
curved_fit <- lm(log2(gene_smoc1) ~ tooth + day + I(day^2), data=teeth)
look(log2(teeth$gene_smoc1), curved_fit)
Another way to do this would be to add the column to the data
frame:
teeth$day_squared <- teeth$day^2
curved_fit2 <- lm(log2(gene_smoc1) ~ tooth + day + day_squared, data=teeth)
Finally, the poly( )
function can be used in a formula
to fit polynomials of arbitrary degree. poly
will encode
day
slightly differently, but produces an equivalent
fit.
curved_fit3 <- lm(log2(gene_smoc1) ~ tooth + poly(day,2), data=teeth)
sigma(curved_fit)
## [1] 0.1630425
sigma(curved_fit2)
## [1] 0.1630425
sigma(curved_fit3)
## [1] 0.1630425
poly( )
can also be used to fit higher order
polynomials, but these tend to become wobbly and extrapolate poorly. A
better option may be to use the ns( )
or bs( )
functions in the splines
package, which can be used to fit
piecewise “B-splines”. In particular ns( )
(natural spline)
is appealing because it extrapolates beyond the ends only with straight
lines. If the data is cyclic (for example cell cycle or circadian time
series), sine and cosine terms can be used to fit some number of
harmonics from a Fourier series.
library(splines)
spline_fit <- lm(log2(gene_smoc1) ~ tooth * ns(day,3), data=teeth)
look(log2(teeth$gene_smoc1), spline_fit)
Day is confounded
with mouse
There may be individual differences between mice. We would like to
take this into our account in a model. In general it is common to
include batch effect terms in a model, to correctly model the data (and
obtain smaller p-values), even if they are not directly of interest.
badfit <- lm(log2(gene_ace) ~ tooth + day + mouse, data=teeth)
summary(badfit)
##
## Call:
## lm(formula = log2(gene_ace) ~ tooth + day + mouse, data = teeth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20562 -0.02335 0.00000 0.02335 0.20562
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.115920 0.658430 -19.920 2.01e-07 ***
## toothupper -0.280467 0.070399 -3.984 0.0053 **
## day 0.992214 0.040228 24.665 4.59e-08 ***
## mouseind2 0.100073 0.131897 0.759 0.4728
## mouseind3 -0.198260 0.125612 -1.578 0.1585
## mouseind4 -0.122056 0.122349 -0.998 0.3517
## mouseind5 -0.203226 0.122349 -1.661 0.1407
## mouseind6 -0.009452 0.125612 -0.075 0.9421
## mouseind7 -0.042441 0.131897 -0.322 0.7570
## mouseind8 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1408 on 7 degrees of freedom
## Multiple R-squared: 0.9934, Adjusted R-squared: 0.9859
## F-statistic: 131.9 on 8 and 7 DF, p-value: 6.131e-07
In this case this is not possible, and R has arbitrarily dropped a
predictor from the model. As a different mouse produced data for each
different day, mouse
is confounded with day
.
The model matrix suffers from multicollinearity: the day
column can be constructed as a linear combination of the intercept
column and the mouse
columns. There is no single best
choice of coefficients.
summary( lm(day ~ mouse, data=teeth) )
## Warning in summary.lm(lm(day ~ mouse, data = teeth)): essentially perfect fit:
## summary may be unreliable
##
## Call:
## lm(formula = day ~ mouse, data = teeth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.766e-16 -9.130e-18 0.000e+00 9.130e-18 4.766e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.450e+01 2.214e-16 6.550e+16 <2e-16 ***
## mouseind2 5.000e-01 3.131e-16 1.597e+15 <2e-16 ***
## mouseind3 1.000e+00 3.131e-16 3.194e+15 <2e-16 ***
## mouseind4 1.500e+00 3.131e-16 4.791e+15 <2e-16 ***
## mouseind5 2.000e+00 3.131e-16 6.388e+15 <2e-16 ***
## mouseind6 2.500e+00 3.131e-16 7.985e+15 <2e-16 ***
## mouseind7 3.000e+00 3.131e-16 9.582e+15 <2e-16 ***
## mouseind8 3.500e+00 3.131e-16 1.118e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.131e-16 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.06e+31 on 7 and 8 DF, p-value: < 2.2e-16
Another example of confounding would be an experiment in which each
treatment is done in a separate batch.
Even if predictors are not perfectly multicollinear, correlation
between predictors can make their coefficient estimates inaccurate. One
way to check for this is to attempt to predict each of the predictors
with a linear model that uses the remaining predictors (see “Variance
Inflation Factor”).
A possible solution to this problem would be to use a “mixed model”,
but this is beyond the scope of today’s workshop.
Challenge - Wnt2 gene (20 minutes)
Look at the expression of gene Wnt2 in column
gene_wnt2
.
Try some different model formulas.
Justify a particular model by rejecting simpler alternatives
using anova( )
.
Some notes:
I haven’t given you a way to compare different transformations of
the response variable gene_wnt2
. This is a tricky problem!
Here I ask you to eyeball the data or trust me when I say log
transformation is appropriate.
An alternative approach is to rank models by the Akaike
Information Criterion (AIC) or similar. AIC lets you compare models even
if they are not nested (but you still can’t compare different
transformations of the response variable).
Mechanical step-wise comparison of models based on many possible
variables raises problems similar to multiple testing, and will not
properly control statistical significance. Use significance tests (or
AIC) in moderation, taking into account your understanding of the
domain. If there are nuisance variables that you are not sure if you
need to adjust for, it is ok to leave them in a model whether or not
they appear significant.
Testing many genes with
limma
In this section we look at fitting the same matrix of predictors X to
many different sets of responses y. We will use the package
limma
from Bioconductor. This is a very brief
demonstration, and there is much more to this package. See the RNAseq123
tutorial at https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
and the usersguide.pdf at https://bioconductor.org/packages/release/bioc/html/limma.html
library(edgeR)
library(limma)
Load the data
Actually in the teeth dataset, the expression level of all genes was
measured!
counts_df <- read_csv("r-linear-files/teeth-read-counts.csv")
counts <- as.matrix( select(counts_df, -gene) )
rownames(counts) <- counts_df$gene
dim(counts)
## [1] 32544 16
counts[1:5,]
## ind1lower ind1upper ind2lower ind2upper ind3lower ind3upper
## 0610007P14Rik 1721 1846 1708 1877 1443 1534
## 0610009B22Rik 512 466 554 504 501 498
## 0610009L18Rik 68 71 121 129 74 124
## 0610009O20Rik 2583 2797 2840 2975 2675 2690
## 0610010F05Rik 1079 1246 1262 992 1232 992
## ind4lower ind4upper ind5lower ind5upper ind6lower ind6upper
## 0610007P14Rik 1389 1439 1699 1881 1697 1716
## 0610009B22Rik 478 427 569 614 577 568
## 0610009L18Rik 83 87 86 125 99 135
## 0610009O20Rik 2429 2313 2693 3093 2854 3066
## 0610010F05Rik 1011 977 1175 1384 1235 1221
## ind7lower ind7upper ind8lower ind8upper
## 0610007P14Rik 1531 1625 1716 1373
## 0610009B22Rik 587 531 611 453
## 0610009L18Rik 130 132 114 92
## 0610009O20Rik 2487 2904 2713 2455
## 0610010F05Rik 1042 998 1079 882
Model matrix
The column names match our teeth
data frame.
colnames(counts)
## [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
## [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"
teeth$sample
## [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
## [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"
We will use limma to fit a linear model to each gene. The same model
formula will be used in each case. limma doesn’t automatically convert a
formula into a model matrix, so we have to do this step manually. Here I
am using a model formula that treats the upper and lower teeth as
following a different linear trend over time.
X <- model.matrix(~ tooth * day, data=teeth)
X
## (Intercept) toothupper day toothupper:day
## 1 1 0 14.5 0.0
## 2 1 1 14.5 14.5
## 3 1 0 15.0 0.0
## 4 1 1 15.0 15.0
## 5 1 0 15.5 0.0
## 6 1 1 15.5 15.5
## 7 1 0 16.0 0.0
## 8 1 1 16.0 16.0
## 9 1 0 16.5 0.0
## 10 1 1 16.5 16.5
## 11 1 0 17.0 0.0
## 12 1 1 17.0 17.0
## 13 1 0 17.5 0.0
## 14 1 1 17.5 17.5
## 15 1 0 18.0 0.0
## 16 1 1 18.0 18.0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$tooth
## [1] "contr.treatment"
limma
refers to this matrix as the “design”.
Remove low expression
genes
There is little chance of detecting differential expression in genes
with very low read counts. Including these genes will require a larger
False Discovery Rate correction, and also confuses limma’s Empirical
Bayes parameter estimation. edgeR
provides a function
filterByExpr
that performs their recommended level of
filtering. The filtering takes into account the complexity of the model
being fitted, so the “design” matrix is also needed.
keep <- filterByExpr(counts, design=X)
table(keep)
## keep
## FALSE TRUE
## 15333 17211
counts_kept <- counts[keep, ]
dim(counts_kept)
## [1] 17211 16
Fitting a model to
and testing each gene
We are now ready to fit a linear model to each gene.
fit <- lmFit(log2_cpms, X)
class(fit)
## [1] "MArrayLM"
## attr(,"package")
## [1] "limma"
fit$coefficients[1:5,]
## (Intercept) toothupper day toothupper:day
## 0610007P14Rik 5.807656 1.3158722 -0.03179865 -0.08061826
## 0610009B22Rik 3.136273 0.5999070 0.03696200 -0.04605341
## 0610009L18Rik -0.902155 1.4890594 0.13042420 -0.08089679
## 0610009O20Rik 6.611072 0.2355852 -0.03671146 -0.01261433
## 0610010F05Rik 5.819858 0.2543831 -0.06338421 -0.02250952
Significance testing in limma is by the use of linear hypotheses
(which limma refers to as “contrasts”). A difference between
glht
and limma’s contrasts.fit
is that limma
expects the linear combinations to be in columns rather than rows.
We will first look for genes where the slope over time is not flat,
averaging the lower and upper teeth.
# Lower slope: c(0,0,1,0)
# Upper slope: c(0,0,1,1)
K <- rbind( average_slope=c(0,0,1,0.5) )
cfit <- contrasts.fit(fit, t(K)) #linear combinations in columns!
efit <- eBayes(cfit, trend=TRUE)
The call to eBayes
does Empirical Bayes squeezing of the
residual variance for each gene (see Appendix B). This is a bit of magic
that allows limma to work well with small numbers of samples.
topTable(efit)
## logFC AveExpr t P.Value adj.P.Val B
## Rnf144b 0.6705081 4.483834 33.64061 8.221793e-16 1.415053e-11 25.97540
## Tfap2b -0.6416163 7.468730 -28.77129 8.766956e-15 3.581088e-11 23.92465
## Six2 -0.5576619 6.978666 -28.60904 9.548454e-15 3.581088e-11 23.84856
## Vldlr 0.4147078 6.814521 28.51996 1.000874e-14 3.581088e-11 23.80655
## Prom2 0.9165496 3.676863 28.27714 1.138751e-14 3.581088e-11 23.69118
## Ace 0.9858071 2.813200 28.06009 1.279153e-14 3.581088e-11 23.58699
## Stk32a 1.2259001 3.708815 27.81963 1.456488e-14 3.581088e-11 23.47034
## Bambi 0.5612789 6.104506 26.74115 2.643012e-14 5.686111e-11 22.93116
## Msx2 0.8126541 6.253647 25.67587 4.872568e-14 9.317974e-11 22.37141
## Cox4i2 0.5686825 3.809406 24.67488 8.854784e-14 1.523997e-10 21.81903
The column adj.P.Val
contains FDR adjusted p-values.
all_results <- topTable(efit, n=Inf)
significant <- all_results$adj.P.Val <= 0.05
table(significant)
## significant
## FALSE TRUE
## 10423 6788
ggplot(all_results, aes(x=AveExpr, y=logFC)) +
geom_point(size=0.1) +
geom_point(data=all_results[significant,], size=0.5, color="red")
Relation to lm( ) and
glht( )
Let’s look at a specific gene.
rnf144b <- log2_cpms["Rnf144b",]
rnf144b_fit <- lm(rnf144b ~ tooth * day, data=teeth)
look(rnf144b, rnf144b_fit)
We can use the same linear hypothesis with glht
. The
estimate is the same, but limma has gained some power by shrinking the
variance toward the trend line, so limma’s p-value is smaller.
summary( glht(rnf144b_fit, K) )
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = rnf144b ~ tooth * day, data = teeth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## average_slope == 0 0.67051 0.01875 35.77 1.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Confidence
intervals
Confidence Intervals should also be of interest. However note that
these are not adjusted for multiple testing (see Appendix B).
topTable(efit, confint=0.95)
## logFC CI.L CI.R AveExpr t P.Value
## Rnf144b 0.6705081 0.6281126 0.7129035 4.483834 33.64061 8.221793e-16
## Tfap2b -0.6416163 -0.6890509 -0.5941817 7.468730 -28.77129 8.766956e-15
## Six2 -0.5576619 -0.5991235 -0.5162002 6.978666 -28.60904 9.548454e-15
## Vldlr 0.4147078 0.3837783 0.4456372 6.814521 28.51996 1.000874e-14
## Prom2 0.9165496 0.8476051 0.9854941 3.676863 28.27714 1.138751e-14
## Ace 0.9858071 0.9110794 1.0605348 2.813200 28.06009 1.279153e-14
## Stk32a 1.2259001 1.1321692 1.3196310 3.708815 27.81963 1.456488e-14
## Bambi 0.5612789 0.5166334 0.6059244 6.104506 26.74115 2.643012e-14
## Msx2 0.8126541 0.7453317 0.8799765 6.253647 25.67587 4.872568e-14
## Cox4i2 0.5686825 0.5196602 0.6177048 3.809406 24.67488 8.854784e-14
## adj.P.Val B
## Rnf144b 1.415053e-11 25.97540
## Tfap2b 3.581088e-11 23.92465
## Six2 3.581088e-11 23.84856
## Vldlr 3.581088e-11 23.80655
## Prom2 3.581088e-11 23.69118
## Ace 3.581088e-11 23.58699
## Stk32a 3.581088e-11 23.47034
## Bambi 5.686111e-11 22.93116
## Msx2 9.317974e-11 22.37141
## Cox4i2 1.523997e-10 21.81903
F test
limma can also test several simultaneous constraints on linear
combinations of coefficients. Suppose we want to find any
deviation from a constant expression level. We can check for this
with:
K2 <- rbind(
c(0,1,0,0),
c(0,0,1,0),
c(0,0,0,1))
cfit2 <- contrasts.fit(fit, t(K2))
efit2 <- eBayes(cfit2, trend=TRUE)
topTable(efit2)
## Coef1 Coef2 Coef3 AveExpr F P.Value
## Pou3f3 3.9677657 -0.3367111 -0.01422066 5.1309354 508.0331 1.388933e-15
## Rnf144b -1.4817168 0.6363724 0.06827130 4.4838343 400.3615 8.469656e-15
## Six2 -0.8370468 -0.6082833 0.10124287 6.9786656 384.2097 1.157124e-14
## Tfap2b -4.3550546 -0.7909806 0.29872853 7.4687297 322.7085 4.330436e-14
## Prom2 0.5804985 0.9612842 -0.08946912 3.6768631 313.2648 5.419522e-14
## Vldlr 2.1955450 0.4751431 -0.12087066 6.8145206 292.9660 8.985449e-14
## Nkx2-3 -12.4481924 -0.4507673 0.20535681 0.4715549 280.1517 1.258886e-13
## Ace -0.5231634 0.9784323 0.01474955 2.8132003 266.6050 1.828639e-13
## Stk32a 0.8473950 1.2638212 -0.07584210 3.7088148 263.0716 2.021843e-13
## Bambi 1.5414027 0.6171353 -0.11171292 6.1045062 251.5382 2.832907e-13
## adj.P.Val
## Pou3f3 2.390492e-11
## Rnf144b 6.638423e-11
## Six2 6.638423e-11
## Tfap2b 1.863278e-10
## Prom2 1.865508e-10
## Vldlr 2.577476e-10
## Nkx2-3 3.095240e-10
## Ace 3.866438e-10
## Stk32a 3.866438e-10
## Bambi 4.875716e-10
A shortcut here would be to skip the contrasts.fit
step
and specify a set of coefficients directly to
topTable( )
.
Further exercises
Construct and use linear combinations to find genes that:
A. Differ in slope between lower and upper molars.
B. Differ in expression on day 16 between the lower and upper molars.
(Hint: this can be viewed as the difference in predictions between two
individual samples.)
C. Construct a pair of linear combinations that when used together in
an F test find genes with non-zero slope in either or both the lower and
upper molars.
Construct a model matrix in which the change over time is
quadratic, and some linear combinations to interrogate it.
Appendix A - mixed
models
In this workshop we have looked at linear models with fixed effects
only. A linear mixed model contains both fixed and random effects.
Random effects are appropriate for individuals drawn from a population.
The model we obtain will include how much variation there is in this
population.
Mixed models would be a whole further workshop, but I will attempt a
rough outline here.
In R, mixed models are most commonly fitted using the
lmer
function in lme4
. A point of confusion
you will encounter is that there are a variety of ways to perform
significance tests. The authors of lme4
duck out of this
debate entirely and only provide t statistics. However packages such as
lmerTest
, emmeans
and pbkrtest
can provide p-values and CIs from an lmer
model.
My recommendation of a safe choice, from my reading, is to fit a
model using the REML method (the default for lmer
) and then
test it using the Kenward-Roger method. The Satterthwaite method is also
decent, usually producing nearly identical results with less
computation. Other methods may produce p-values that are too small with
small data sets. This includes the multcomp
package that we
used earlier, unfortunately. For more information see:
Luke, S.G. (2017). Evaluating significance in linear mixed-effects
models in R. Behav Res 49, 1494–1502 https://doi.org/10.3758/s13428-016-0809-y
The differences between methods disappear when your data is large
enough, but “large enough” might mean not just many rows in your data
frame but also many levels for your random effects.
I will briefly demonstrate lmer
with the PVC particle
dataset, now treating resin batches as a random effect.
library(lmerTest) # lmerTest builds on lme4, can produce p-values
library(emmeans) # emmeans can provide Kenward-Roger CIs
pvc <- read_csv("r-linear-files/pvc.csv")
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)
pvcfit_mixed <- lmer(psize ~ operator + (1|resin), data=pvc, REML=TRUE)
pvcfit_mixed
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: psize ~ operator + (1 | resin)
## Data: pvc
## REML criterion at convergence: 172.2304
## Random effects:
## Groups Name Std.Dev.
## resin (Intercept) 2.558
## Residual 1.145
## Number of obs: 48, groups: resin, 8
## Fixed Effects:
## (Intercept) operatorBob operatorCarl
## 32.9438 -0.2625 -1.5063
We use new ( | )
notation to specify random effects.
Right of the |
you give the factor specifying individuals
from a population. Left of the |
you give predictors whose
coefficients may vary between individuals. In the example above we are
saying that the intercept may vary between individuals.
The mean about which individuals vary must also be included as a
fixed effect in the model! Here an intercept term is implicitly included
in the model as usual. However if your random effect specified a slope
or some-such it would also need to be given as a fixed effect. Otherwise
you would be saying the slope varies around zero.
A standard deviation has been estimated for the effect of different
batches of resin on particle size, as well as the usual residual
standard deviation. This is an extra source of random variation around
the prediction from the fixed effect part of the model.
drop1
from lmerTest
can be used to test
whether fixed effects are statistically significant.
drop1(pvcfit_mixed, ddf="Kenward-Roger")
## Single term deletions using Kenward-Roger's method:
##
## Model:
## psize ~ operator + (1 | resin)
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## operator 20.718 10.359 2 38 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ranova
from lmerTest
can be used to test
whether random effects are statistically significant.
ranova(pvcfit_mixed)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## psize ~ operator + (1 | resin)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 5 -86.115 182.23
## (1 | resin) 4 -113.096 234.19 53.961 1 2.045e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans
can be used on the fixed effects of the
model.
result <- emmeans(pvcfit_mixed, pairwise ~ operator, lmer.df="Kenward-Roger")
result$emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.949 7.93 30.8 35.1
## Bob 32.7 0.949 7.93 30.5 34.9
## Carl 31.4 0.949 7.93 29.2 33.6
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
confint(result$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Alice - Bob 0.263 0.405 38 -0.725 1.25
## Alice - Carl 1.506 0.405 38 0.519 2.49
## Bob - Carl 1.244 0.405 38 0.257 2.23
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
For comparison, here are similar results from the fixed effects
model:
pvcfit_fixed <- lm(psize ~ operator + resin, data=pvc)
drop1(pvcfit_fixed, test="F")
## Single term deletions
##
## Model:
## psize ~ operator + resin
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 49.82 21.782
## operator 2 20.718 70.53 34.474 7.902 0.00135 **
## resin 7 283.946 333.76 99.083 30.943 8.111e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
result_fixed <- emmeans(pvcfit_fixed, pairwise ~ operator)
result_fixed$emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.286 38 32.4 33.5
## Bob 32.7 0.286 38 32.1 33.3
## Carl 31.4 0.286 38 30.9 32.0
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
confint(result_fixed$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Alice - Bob 0.263 0.405 38 -0.725 1.25
## Alice - Carl 1.506 0.405 38 0.519 2.49
## Bob - Carl 1.244 0.405 38 0.257 2.23
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
The simpler fixed effects model can be a useful check that you are
fitting the mixed effects model you intend. The results will usually be
similar, and any differences may be instructive. A fixed effect model
might be chosen either by ignoring random effects or including them as
fixed effects. In this example we see:
The pairwise comparisons are identical.
CIs for the means for each operator are wider for the mixed
model. This is because they are an estimate of the mean for the whole
population of resins. For the fixed effects model, we get the CI for the
mean of exactly the eight resins present in the data.
Appendix B - more on
limma
Empirical Bayes
variance squeezing
In limma, Empirical Bayes squeezing of the residual variance acts as
though we have some number of extra “prior” observations of the
variance. These are also counted as extra degrees of freedom in F tests.
The “prior” observations act to squeeze the estimated residual variance
toward a trend line that is a function of the average expression
level.
efit <- eBayes(cfit, trend=TRUE)
efit$df.prior
## [1] 3.36434
efit$df.residual
## [1] 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
## [ reached getOption("max.print") -- omitted 17190 entries ]
efit$df.total
## [1] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
## [9] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
## [17] 15.36434 15.36434 15.36434 15.36434 15.36434
## [ reached getOption("max.print") -- omitted 17190 entries ]
plotSA(efit)
points(efit$Amean, efit$s2.post^0.25, col="red", cex=0.2)
The total effective degrees of freedom is the “prior” degrees of
freedom plus the normal residual degrees of freedom. As can be seen in
the plot, compared to the residual variance (black dots), this produces
a posterior residual variance (efit$s2.post
, red dots) that
is squeezed toward the trend line.
It’s worthwhile checking df.prior
when using limma, as a
low value may indicate a problem with a data-set.
Top confident effect
sizes
A little self-promotion: With limma we are able to find genes where
our effect of interest is significantly different from zero. However we
may make many thousands of discoveries, too many to easily follow up,
and some of the effects discovered may be tiny. I have developed a
method to rank genes by confident effect size, with multiple testing
correction, implemented in the package topconfects. This method builds
on the treat
method provided by limma.
This is performed with the limma_confects
function.
limma_confects
incorporates the Empirical Bayes variance
squeezing step, so remember to specify trend=TRUE
if using
trend based variance squeezing.
library(topconfects)
result <- limma_confects(cfit, "average_slope", trend=TRUE, fdr=0.05)
result
## $table
## confect effect AveExpr name
## 1 -1.411 -2.655 1.555 Hbb-y
## 2 1.356 2.011 4.139 Mmp13
## 3 1.318 2.302 -1.784 Scgn
## 4 1.290 1.796 2.480 Ecel1
## 5 1.248 1.843 -1.503 Sdr16c6
## 6 1.233 2.095 1.197 Tbx22
## 7 -1.201 -1.701 0.688 Hba-x
## 8 1.171 2.015 3.501 Igfbpl1
## 9 1.075 1.706 5.724 Vwde
## 10 1.075 1.694 3.209 Adgrf4
## ...
## 6788 of 17211 non-zero log2 fold change at FDR 0.05
## Prior df 3.4
Results are ranked by “confect”, confident effect size. If you select
genes with abs(confect) >=
some value, those genes will
have a magnitude of effect greater than that threshold, with controlled
FDR. (Exactly the same set of genes is found using treat
with this threshold.)
Compared to the “most significant” gene, the top gene by topconfects
has a larger slope (but lower overall average expression).
hbby <- log2_cpms["Hbb-y",]
hbby_fit <- lm(hbby ~ tooth * day, data=teeth)
look(hbby, hbby_fit)
# Highlight genes with slope confidently larger than 0.2 per day.
ggplot(result$table, aes(x=AveExpr, y=effect)) +
geom_point(size=0.1) +
geom_point(
data=filter(result$table, abs(confect) >= 0.2),
size=0.5, color="red")
See also “False Coverage-statement Rate” for a generally applicable
approach to multiple-testing corrected confidence intervals.
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## time zone: Australia/Melbourne
## tzcode source: internal
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] topconfects_1.18.0 lmerTest_3.1-3 lme4_1.1-35.3 Matrix_1.6-5
## [5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] ggplot2_3.5.0 tidyverse_2.0.0 edgeR_4.0.16 limma_3.58.1
## [17] emmeans_1.10.1 multcomp_1.4-25 TH.data_1.1-2 survival_3.5-8
## [21] mvtnorm_1.2-4 MASS_7.3-60.0.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 xfun_0.43 bslib_0.7.0
## [4] lattice_0.22-6 numDeriv_2016.8-1.1 tzdb_0.4.0
## [7] vctrs_0.6.5 tools_4.3.3 generics_0.1.3
## [10] pbkrtest_0.5.2 parallel_4.3.3 sandwich_3.1-0
## [13] fansi_1.0.6 highr_0.10 pkgconfig_2.0.3
## [16] assertthat_0.2.1 lifecycle_1.0.4 farver_2.1.1
## [19] compiler_4.3.3 statmod_1.5.0 munsell_0.5.1
## [22] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.9
## [25] yaml_2.3.8 nloptr_2.0.3 pillar_1.9.0
## [28] crayon_1.5.2 jquerylib_0.1.4 cachem_1.0.8
## [31] boot_1.3-30 nlme_3.1-164 tidyselect_1.2.1
## [34] locfit_1.5-9.9 digest_0.6.35 stringi_1.8.3
## [37] labeling_0.4.3 fastmap_1.1.1 grid_4.3.3
## [40] colorspace_2.1-0 cli_3.6.2 magrittr_2.0.3
## [43] utf8_1.2.4 broom_1.0.5 withr_3.0.0
## [46] backports_1.4.1 scales_1.3.0 bit64_4.0.5
## [49] estimability_1.5 timechange_0.3.0 rmarkdown_2.26
## [52] bit_4.0.5 zoo_1.8-12 hms_1.1.3
## [55] coda_0.19-4.1 evaluate_0.23 knitr_1.46
## [58] mgcv_1.9-1 rlang_1.1.3 Rcpp_1.0.12
## [61] xtable_1.8-4 glue_1.7.0 minqa_1.2.6
## [64] vroom_1.6.5 jsonlite_1.8.8 R6_2.5.1
Home