1 Load packages

library(MASS)       # ginv -- coefficient estimation
library(splines)    # ns, bs -- spline curves
library(multcomp)   # glht -- linear hypotheses
library(emmeans)    # estimated marginal means
library(edgeR)      # cpm, etc -- RNA-Seq normalization
library(limma)      # lmFit, etc -- fitting many models
library(tidyverse)  # working with data frames, plotting

Much of what we will be using is built into R without loading any packages.

2 Vectors and matrices

2.1 Vector operations

a <- c(3,4)
b <- c(5,6)

length(a)
## [1] 2

R performs operations elementwise:

a + b
## [1]  8 10
a * b
## [1] 15 24

We will be using the dot product a lot. This is:

sum(a*b)
## [1] 39
t(a) %*% b
##      [,1]
## [1,]   39

The geometric length of a vector is (by Pythagoras, aka Euclidean distance):

sqrt(sum(a*a))
## [1] 5

2.2 Matrix operations

We can create a matrix with matrix, rbind (row bind), or cbind (column bind).

matrix(c(1,2,3,4), nrow=2, ncol=2)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
rbind(c(1,3), c(2,4))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
cbind(c(1,2), c(3,4))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
X <- rbind(
    c(1,0),
    c(1,0),
    c(1,1),
    c(1,1))
X
##      [,1] [,2]
## [1,]    1    0
## [2,]    1    0
## [3,]    1    1
## [4,]    1    1
class(X)
## [1] "matrix" "array"

The matrix transpose is obtained with t.

t(X)
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    0    0    1    1

Matrix multiplication is performed with %*%. The dot product of each row of the left hand side matrix and each column of the right hand side matrix is calculated. %*% treats a vector as either a single column or single row matrix as will make sense for matrix multiplication. Actually all we need today is to multiply a matrix by a vector, in which case we get the dot product of each row of the matrix with the vector.

X %*% a
##      [,1]
## [1,]    3
## [2,]    3
## [3,]    7
## [4,]    7
as.vector(X %*% a)
## [1] 3 3 7 7

2.3 Challenge - use a dot product to calculate

The following dot product is an elaborate way to retrieve x[2]:

x <- c(10,20,30,40)
weights <- c(0,1,0,0)       # <-- modify this line
sum(weights*x)

Modify weights in the above to calculate different quantities:

A. x[3]-x[2]

B. The mean of all four values.

3 Single numerical predictor

The age (year) and height (cm) of 10 people has been measured. We want a model that can predict height based on age.

people <- read_csv(
    "age, height
      10,    131
      14,    147
      16,    161
       9,    136
      16,    170
      15,    160
      15,    153
      21,    187
       9,    145
      21,    195")

ggplot(people, aes(x=age, y=height)) + geom_point()

fit <- lm(height ~ age, data=people)

fit
## 
## Call:
## lm(formula = height ~ age, data = people)
## 
## Coefficients:
## (Intercept)          age  
##      92.612        4.513

Coefficients are extracted with coef:

coef(fit)
## (Intercept)         age 
##   92.611502    4.512911

The residual standard deviation is extracted with sigma:

sigma(fit)
## [1] 7.263536

Behind the scenes a matrix of predictors has been produced from the mysterious notation ~ age. We can examine it explicitly:

model.matrix(fit)

model.matrix can be used without first calling lm.

model.matrix(~ age, data=people)
##    (Intercept) age
## 1            1  10
## 2            1  14
## 3            1  16
## 4            1   9
## 5            1  16
## 6            1  15
## 7            1  15
## 8            1  21
## 9            1   9
## 10           1  21
## attr(,"assign")
## [1] 0 1

n=10 observations minus p=2 columns in the model matrix leaves 8 residual degrees of freedom:

df.residual(fit)
## [1] 8

3.1 Prediction

predict predicts. By default it produces predictions on the original dataset.

predict(fit)
##        1        2        3        4        5        6        7        8 
## 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826 
##        9       10 
## 133.2277 187.3826
predict(fit, interval="confidence")
##         fit      lwr      upr
## 1  137.7406 129.8100 145.6712
## 2  155.7923 150.4399 161.1446
## 3  164.8181 159.2250 170.4111
## 4  133.2277 124.3009 142.1545
## 5  164.8181 159.2250 170.4111
## 6  160.3052 154.9836 165.6267
## 7  160.3052 154.9836 165.6267
## 8  187.3826 177.6105 197.1547
## 9  133.2277 124.3009 142.1545
## 10 187.3826 177.6105 197.1547

(Note: This is a 95% confidence interval for uncertainty in the model. Individuals additionally vary with standard deviation sigma.)

We can also calculate predictions manually.

# Prediction for a 15-year old
x <- c(1, 15)
beta <- coef(fit)
sum(x * beta)
## [1] 160.3052
# Prediction for all original data
X <- model.matrix(fit)
as.vector( X %*% beta )
##  [1] 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826
##  [9] 133.2277 187.3826

predict can be used with new data.

new_people <- tibble(age=5:25)
predict(fit, new_people)
##        1        2        3        4        5        6        7        8 
## 115.1761 119.6890 124.2019 128.7148 133.2277 137.7406 142.2535 146.7664 
##        9       10       11       12       13       14       15       16 
## 151.2793 155.7923 160.3052 164.8181 169.3310 173.8439 178.3568 182.8697 
##       17       18       19       20       21 
## 187.3826 191.8955 196.4085 200.9214 205.4343
new_predictions <- cbind(
    new_people, 
    predict(fit, new_people, interval="confidence"))

ggplot() +
    geom_ribbon(aes(x=age, ymin=lwr, ymax=upr), data=new_predictions, fill="grey") +
    geom_line(aes(x=age, y=fit), data=new_predictions, color="blue") +
    geom_point(aes(x=age, y=height), data=people) +
    labs(y="height (cm)", x="age (year)", 
         subtitle="Ribbon shows 95% confidence interval of the model")

If you have ever used geom_smooth, it should now be a little less mysterious.

ggplot(people, aes(x=age, y=height)) + geom_smooth(method="lm") + geom_point()
## `geom_smooth()` using formula = 'y ~ x'

3.2 Residuals

The residuals are the differences between the actual and predicted values.

residuals(fit)
##          1          2          3          4          5          6          7 
## -6.7406103 -8.7922535 -3.8180751  2.7723005  5.1819249 -0.3051643 -7.3051643 
##          8          9         10 
## -0.3826291 11.7723005  7.6173709

There should be no remaining relationship between predictions and the residuals (or between any individual predictors and the residual).

plot(predict(fit), residuals(fit))

A Q-Q (quantile-quantile) plot sorts the residuals and compares them to what would be expected from a normal distribution.

qqnorm(residuals(fit))
qqline(residuals(fit))

Ideally points would lie close to the line, but deviations are not a disaster. Our coefficient estimates will tend toward normally distributed errors even if the data does not, due to the Central Limit Theorem. Wild outliers should be investigated, as they may have a large effect on the model. We will see further examples of things to look for in a Q-Q plot in section 6.

plot(fit) produces a series of more sophisticated diagnostic plots.

plot(fit)

4 Single factor predictor, two levels

Consider a simple experiment where some outcome is measured for an untreated and a treated group. This can be viewed as a one-way ANalysis Of VAriance (ANOVA) experiment. (This is one of two senses in which the term ANOVA will be used today.)

outcomes <- read_csv(
       "group, outcome
    untreated,  4.98
    untreated,  5.17
    untreated,  5.66
    untreated,  4.87
      treated,  8.07
      treated, 11.02
      treated,  9.91")

outcomes$group <- factor(outcomes$group, c("untreated", "treated"))
outfit <- lm(outcome ~ group, data=outcomes)
outfit
## 
## Call:
## lm(formula = outcome ~ group, data = outcomes)
## 
## Coefficients:
##  (Intercept)  grouptreated  
##        5.170         4.497
df.residual(outfit)
## [1] 5
sigma(outfit)
## [1] 0.9804353
model.matrix(outfit)
##   (Intercept) grouptreated
## 1           1            0
## 2           1            0
## 3           1            0
## 4           1            0
## 5           1            1
## 6           1            1
## 7           1            1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

4.1 How coefficients are estimated

Coefficients can be estimated from responses by multiplying by the “Moore-Penrose generalized inverse” of X. It can be useful to examine this matrix to work out exactly what a fit is doing. Each row shows how the corresponding coefficient is estimated.

X <- model.matrix(outfit)
y <- outcomes$outcome
round(ginv(X), 3)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,]  0.25  0.25  0.25  0.25 0.000 0.000 0.000
## [2,] -0.25 -0.25 -0.25 -0.25 0.333 0.333 0.333
ginv(X) %*% y
##          [,1]
## [1,] 5.170000
## [2,] 4.496667

Here we can see the first coefficient is the average of the “untreated” samples, and the second is the average of the “treated” samples minus that average of the “untreated” samples.

( y contains noise, assumed to be identically normally distributed for each observation. Transformation of this noise by ginv(X) tells us the distribution of errors in the coefficients (see vcov()). This can be further propagated to give the distribution of errors in predictions, and in other linear combinations of coefficients. )

4.2 Class exercise - the meanings of coefficients

We now consider the formula outcome ~ 0 + group.

Examine the model matrix that will be used:

model.matrix(~ 0 + group, data=outcomes)
  1. What column has been removed because 0 + was used?

  2. R has responded to this by being a bit clever when representing the factor. What column has been added?

  3. The mean of the untreated group is 5.2, the mean of the treated group is 9.7, and the difference between them is 4.5. Without using lm, what values should the coefficients have to best fit the data?

Now perform the actual linear model fit:

outfit2 <- lm(outcome ~ 0 + group, data=outcomes)
  1. Looking at the residuals and sigma, does the new model fit the data better or worse than the original?

4.3 Testing a hypothesis

Besides data with categorical predictors, the term ANOVA is used to refer to the use of the F test. Significance is judged based on comparing the Residual Sums of Squares of two models. We fit a model representing a null hypothesis. This model formula must nest within our original model formula: any prediction it can make must also be possible to be made by the original model formula. We compare the models using the anova function.

outfit0 <- lm(outcome ~ 1, data=outcomes)

anova(outfit0, outfit)
## Analysis of Variance Table
## 
## Model 1: outcome ~ 1
## Model 2: outcome ~ group
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1      6 39.469                               
## 2      5  4.806  1    34.663 36.06 0.001839 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Warning: This is not the only way to use the anova( ) function, but I think it is the safest way. Once we start using multiple predictors, the meaning of the output from anova with a single model is likely to be not quite what you want, read the documentation carefully. drop1(fit, test="F") should usually be used instead.

summary( ) also outputs p-values. Too many p-values, summary( ) doesn’t respect the hierarchy of terms in the model. The p-value for dropping the intercept is nonsense. The p-values are based on a t statistic, where F=t^2.

summary(outfit)
## 
## Call:
## lm(formula = outcome ~ group, data = outcomes)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -1.900e-01  7.992e-16  4.900e-01 -3.000e-01 -1.597e+00  1.353e+00  2.433e-01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.1700     0.4902  10.546 0.000132 ***
## grouptreated   4.4967     0.7488   6.005 0.001839 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9804 on 5 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8539 
## F-statistic: 36.06 on 1 and 5 DF,  p-value: 0.001839

confint( ) tells us not only that the difference between groups is non-zero but places a confidence interval on the difference. If the p-value were 0.05, the confidence interval would just touch zero. Whenever we reject a hypothesis that a single coefficient is zero, we may also conclude that we know its sign.

confint(outfit)
##                 2.5 %   97.5 %
## (Intercept)  3.909855 6.430145
## grouptreated 2.571764 6.421569

These results exactly match those of a t-test.

t.test(outcomes$outcome[5:7], outcomes$outcome[1:4], var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  outcomes$outcome[5:7] and outcomes$outcome[1:4]
## t = 6.005, df = 5, p-value = 0.001839
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.571764 6.421569
## sample estimates:
## mean of x mean of y 
##  9.666667  5.170000

4.4 Challenge - does height change with age?

Return to the people dataset.

  1. Can we reject the hypothesis that height is unrelated to age?

  2. Compare the result to the outcome of a correlation test using cor.test( ).

  3. What is the 95% confidence interval on the slope, in cm per year?

5 Multiple factors, many levels

Particle sizes of PVC plastic produced by a machine are measured. The machine is operated by three different people, and eight different batches of resin are used. Two measurements are made for each combination of these two experimental factors.

(This example is adapted from a data-set in the faraway package.)

pvc <- read_csv("r-linear-files/pvc.csv")
## Rows: 48 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): operator, resin
## dbl (1): psize
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)

ggplot(pvc, aes(x=resin, y=psize)) + geom_point() + facet_grid(~operator)

5.1 Main effects

pvcfit1 <- lm(psize ~ operator + resin, data=pvc)
summary(pvcfit1)
## 
## Call:
## lm(formula = psize ~ operator + resin, data = pvc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9500 -0.6125 -0.0167  0.4063  3.6833 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.2396     0.5226  69.345  < 2e-16 ***
## operatorBob   -0.2625     0.4048  -0.648  0.52059    
## operatorCarl  -1.5063     0.4048  -3.721  0.00064 ***
## resinR2       -1.0333     0.6610  -1.563  0.12630    
## resinR3       -5.8000     0.6610  -8.774 1.13e-10 ***
## resinR4       -6.1833     0.6610  -9.354 2.11e-11 ***
## resinR5       -4.8000     0.6610  -7.261 1.09e-08 ***
## resinR6       -5.4500     0.6610  -8.245 5.46e-10 ***
## resinR7       -2.9167     0.6610  -4.412 8.16e-05 ***
## resinR8       -0.1833     0.6610  -0.277  0.78302    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 38 degrees of freedom
## Multiple R-squared:  0.8595, Adjusted R-squared:  0.8262 
## F-statistic: 25.82 on 9 and 38 DF,  p-value: 1.474e-13
confint(pvcfit1)
##                  2.5 %     97.5 %
## (Intercept)  35.181635 37.2975319
## operatorBob  -1.081983  0.5569834
## operatorCarl -2.325733 -0.6867666
## resinR2      -2.371544  0.3048775
## resinR3      -7.138211 -4.4617892
## resinR4      -7.521544 -4.8451225
## resinR5      -6.138211 -3.4617892
## resinR6      -6.788211 -4.1117892
## resinR7      -4.254877 -1.5784559
## resinR8      -1.521544  1.1548775

This model assumes the influence of the two factors is additive, the model only contains the “main effects”. The meanings of the coefficients are:

  • “(Intercept)” is the particle size for Alice and R1
  • “operatorBob” is the step in particle size going from Alice to Bob
  • “operatorCarl” is the step in particle size going from Alice to Carl
  • “resinR2” is the step in particle size going from R1 to R2
  • “resinR3” is the step in particle size going from R1 to R3
  • (etc)

We can use anova( ) to test if there is evidence either of these main effects is important. For example, to test if there is evidence that the operator is important to the outcome we can test pvcfit1 against a model in which operator is dropped:

pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
## 
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1     40 70.533                              
## 2     38 49.815  2    20.718 7.902 0.00135 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The drop1( ) function with test="F" can be used as a shorthand to test dropping each term that can be dropped from the model formula.

drop1(pvcfit1, test="F")
## Single term deletions
## 
## Model:
## psize ~ operator + resin
##          Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                 49.82 21.782                      
## operator  2    20.718  70.53 34.474   7.902   0.00135 ** 
## resin     7   283.946 333.76 99.083  30.943 8.111e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2 Interactions

We can ask if there is any interaction between the two factors. For example Carl might produce particularly small particles with R5. An additive model doesn’t allow for this.

pvcfit2 <- lm(psize ~ operator + resin + operator:resin, data=pvc)
# or
pvcfit2 <- lm(psize ~ operator*resin, data=pvc)

pvcfit2
## 
## Call:
## lm(formula = psize ~ operator * resin, data = pvc)
## 
## Coefficients:
##          (Intercept)           operatorBob          operatorCarl  
##                36.25                 -0.85                 -0.95  
##              resinR2               resinR3               resinR4  
##                -1.10                 -5.55                 -6.55  
##              resinR5               resinR6               resinR7  
##                -4.40                 -6.05                 -3.35  
##              resinR8   operatorBob:resinR2  operatorCarl:resinR2  
##                 0.55                  1.05                 -0.85  
##  operatorBob:resinR3  operatorCarl:resinR3   operatorBob:resinR4  
##                -0.20                 -0.55                  1.20  
## operatorCarl:resinR4   operatorBob:resinR5  operatorCarl:resinR5  
##                -0.10                  0.40                 -1.60  
##  operatorBob:resinR6  operatorCarl:resinR6   operatorBob:resinR7  
##                 1.30                  0.50                  0.45  
## operatorCarl:resinR7   operatorBob:resinR8  operatorCarl:resinR8  
##                 0.85                  0.50                 -2.70

This model allows for interactions between the two factors. There are enough predictors in the model matrix that each combination of levels can take on a distinct value. So we now have

  • “(Intercept)” is particle size for Alice and R1
  • “operatorBob” is the step in particle size from Alice to Bob, for R1
  • “operatorCarl” is the step in particle size from Alice to Carl, for R1
  • “resinR2” is the step in particle size from R1 to R2, for Alice
  • (etc)
  • “operatorBob:resinR2” is the particle size for Bob and R2, relative to (Intercept)+operatorBob+resinR2.
  • (etc)
anova(pvcfit1, pvcfit2)
## Analysis of Variance Table
## 
## Model 1: psize ~ operator + resin
## Model 2: psize ~ operator * resin
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     38 49.815                           
## 2     24 35.480 14    14.335 0.6926 0.7599

We do not reject the main effects model. Our data does not provide evidence that the interaction model is needed. Until fresh data demands that we need an interaction model, we will proceed with the main effects model.

Note: Ideally we would have checked for evidence of an interaction effect before examining the main effects model in the previous section.

5.3 Contrasts and confidence intervals

anova( ) lets us test if a particular factor or interaction is needed at all, and summary( ) allows us to see if any levels of a factor differ from the first level. However we may wish to perform different comparisons of the levels of a factor – this is called a “contrast”. We might also want to test some more complicated combination of coefficients such as a difference between two hypothetical individuals. In general this is called a “linear hypothesis” or a “general linear hypothesis”.

Say we want to compare Bob and Carl’s particle sizes. We will use the pvcfit1 model.

coef(pvcfit1)
##  (Intercept)  operatorBob operatorCarl      resinR2      resinR3      resinR4 
##   36.2395833   -0.2625000   -1.5062500   -1.0333333   -5.8000000   -6.1833333 
##      resinR5      resinR6      resinR7      resinR8 
##   -4.8000000   -5.4500000   -2.9166667   -0.1833333
K <- rbind(Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))

K %*% coef(pvcfit1)
##                 [,1]
## Carl_vs_Bob -1.24375

This is the estimated difference in particle size between Carl and Bob, but can we trust it? The glht function from multcomp can tell us. GLHT stands for General Linear Hypothesis Test.

library(multcomp)

result <- glht(pvcfit1, K)
result
## 
##   General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate
## Carl_vs_Bob == 0   -1.244
summary(result)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)   
## Carl_vs_Bob == 0  -1.2437     0.4048  -3.072  0.00391 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Quantile = 2.0244
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                  Estimate lwr     upr    
## Carl_vs_Bob == 0 -1.2437  -2.0632 -0.4243

glht can test multiple hypotheses at once. By default it applies a multiple testing correction when doing so. This is a generalization of Tukey’s Honestly Significant Differences.

K <- rbind(
    Bob_vs_Alice  = c(0,  1,0, 0,0,0,0,0,0,0),
    Carl_vs_Alice = c(0,  0,1, 0,0,0,0,0,0,0),
    Carl_vs_Bob   = c(0, -1,1, 0,0,0,0,0,0,0))
result <- glht(pvcfit1, K)
summary(result)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)   
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648  0.79431   
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721  0.00177 **
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072  0.01056 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Quantile = 2.4397
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                    Estimate lwr     upr    
## Bob_vs_Alice == 0  -0.2625  -1.2501  0.7251
## Carl_vs_Alice == 0 -1.5063  -2.4938 -0.5187
## Carl_vs_Bob == 0   -1.2437  -2.2313 -0.2562
plot(result)

We can also turn off the multiple testing correction.

summary(result, test=adjusted("none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)    
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648  0.52059    
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721  0.00064 ***
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072  0.00391 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

A reasonable compromise between these extremes is Benjamini and Hochberg’s False Discovery Rate (FDR) correction.

summary(result, test=adjusted("fdr"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)   
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648  0.52059   
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721  0.00192 **
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072  0.00587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- fdr method)

Finally, we can ask if any of the linear combinations is non-zero, i.e. whether the model with all three constraints applied can be rejected. This is equivalent to the anova( ) tests we have done earlier. (Note that while we have three constraints, the degrees of freedom reduction is 2, as any 2 of the constraints are sufficient. This makes me uneasy as it is reliant on numerical accuracy, better to just use any two of the constraints.)

summary(result, test=Ftest())
## 
##   General Linear Hypotheses
## 
## Linear Hypotheses:
##                    Estimate
## Bob_vs_Alice == 0   -0.2625
## Carl_vs_Alice == 0  -1.5063
## Carl_vs_Bob == 0    -1.2437
## 
## Global Test:
##       F DF1 DF2  Pr(>F)
## 1 7.902   2  38 0.00135
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
## 
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1     40 70.533                              
## 2     38 49.815  2    20.718 7.902 0.00135 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This demonstrates that the two methods of testing hypotheses–with the ANOVA test and with linear hypotheses–are equivalent.

5.4 Heteroscedasticity

ggplot(pvc, aes(x=resin, y=residuals(pvcfit1))) + 
    geom_point() + geom_hline(yintercept=0) + facet_grid(~operator)

Our assumption that the residual noise is uniformly normally distributed may not hold. Carl’s data seems to have greater standard deviation than Alice or Bob’s. When comparing Alice and Bob’s results, including Carl’s data in the model may alter the outcome.

5.5 Challenge - examine contrasts

Using the pvcfit1 model, construct linear hypotheses to see if the effect of:

  1. R8 is different to R4
  2. R2 is different to R1

5.6 Easier ways to specify contrasts

So far we have been manually constructing linear hypotheses. The multcomp package automates this for some common situations. To compare all pairs of levels within a factor, use the mcp function, giving the factor to test as the name of an argument and specifying to test “Tukey” contrasts:

result <- glht(pvcfit1, mcp(operator="Tukey"))

To compare the first level of a factor to all other levels, specify to test “Dunnett” contrasts:

result <- glht(pvcfit1, mcp(operator="Dunnett"))

The linear hypotheses actually used can be inspected with result$linfct.

The emmeans package also automates many common comparisons.

library(emmeans)

# Mean of the predictions for a set of typical individuals. 
# Not the mean of your input data!
operator_means <- emmeans(pvcfit1, ~ operator)
operator_means
##  operator emmean    SE df lower.CL upper.CL
##  Alice      32.9 0.286 38     32.4     33.5
##  Bob        32.7 0.286 38     32.1     33.3
##  Carl       31.4 0.286 38     30.9     32.0
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95
# Differences in the mean of predictions between groups.
operator_pairwise <- emmeans(pvcfit1, pairwise ~ operator)
operator_pairwise
## $emmeans
##  operator emmean    SE df lower.CL upper.CL
##  Alice      32.9 0.286 38     32.4     33.5
##  Bob        32.7 0.286 38     32.1     33.3
##  Carl       31.4 0.286 38     30.9     32.0
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast     estimate    SE df t.ratio p.value
##  Alice - Bob     0.263 0.405 38   0.648  0.7944
##  Alice - Carl    1.506 0.405 38   3.721  0.0018
##  Bob - Carl      1.244 0.405 38   3.072  0.0107
## 
## Results are averaged over the levels of: resin 
## P value adjustment: tukey method for comparing a family of 3 estimates
confint(operator_pairwise$contrasts)
##  contrast     estimate    SE df lower.CL upper.CL
##  Alice - Bob     0.263 0.405 38   -0.725     1.25
##  Alice - Carl    1.506 0.405 38    0.519     2.49
##  Bob - Carl      1.244 0.405 38    0.257     2.23
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
operator_vs_control <- emmeans(pvcfit1, trt.vs.ctrl ~ operator)
operator_vs_control
## $emmeans
##  operator emmean    SE df lower.CL upper.CL
##  Alice      32.9 0.286 38     32.4     33.5
##  Bob        32.7 0.286 38     32.1     33.3
##  Carl       31.4 0.286 38     30.9     32.0
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast     estimate    SE df t.ratio p.value
##  Bob - Alice    -0.263 0.405 38  -0.648  0.7349
##  Carl - Alice   -1.506 0.405 38  -3.721  0.0013
## 
## Results are averaged over the levels of: resin 
## P value adjustment: dunnettx method for 2 tests
confint(operator_vs_control$contrasts)
##  contrast     estimate    SE df lower.CL upper.CL
##  Bob - Alice    -0.263 0.405 38    -1.20    0.672
##  Carl - Alice   -1.506 0.405 38    -2.44   -0.571
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95 
## Conf-level adjustment: dunnettx method for 2 estimates

emmeans does not calculate means from your data directly. Instead it calculates the mean of the predictions for a set of typical individuals (the “reference grid”). If individuals in your groups are affected by known variables, and you include these variables as predictors in your model, emmeans will naturally adjust for these to provide fair comparisons between groups.

By default emmeans will use a reference grid with one individual for each combination of the levels of the factors in your model. For any numeric predictor, the average of the input data is used. It is up to you to decide if this is sensible! This approach provides reasonable means even in models with interaction terms.

emmeans is a super cool package, and has great documentation. In practice, this is usually our preferred option.

6 Gene expression example

Tooth growth in mouse embryos is studied using RNA-Seq. The RNA expression levels of several genes are examined in the cells that form the upper and lower first molars, in eight individual mouse embryos that have been dissected after different times of embryo development. The measurements are in terms of “Reads Per Million”, essentially the fraction of RNA in each sample belonging to each gene, times 1 million.

(This data was extracted from ARCHS4 (https://amp.pharm.mssm.edu/archs4/). In the Gene Expression Omnibus it is entry GSE76316. The sample descriptions in GEO seem to be out of order, but reading the associated paper and the genes they talk about I think I have the correct order of samples!)

teeth <- read_csv("r-linear-files/teeth.csv")
## Rows: 16 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sample, mouse, tooth
## dbl (5): day, gene_ace, gene_smoc1, gene_pou3f3, gene_wnt2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teeth$tooth <- factor(teeth$tooth, levels=c("lower","upper"))
teeth$mouse <- factor(teeth$mouse)

It will be convenient to have a quick way to examine different genes and different models with this data.

# A convenience to examine different model fits
more_data <- expand.grid(
    day=seq(14.3,18.2,by=0.01),
    tooth=factor(c("lower","upper"), levels=c("lower","upper")))

look <- function(y, fit=NULL) {
    p <- ggplot(teeth,aes(x=day,group=tooth))
    if (!is.null(fit)) {
        more_ci <- cbind(
            more_data, 
            predict(fit, more_data, interval="confidence"))
        p <- p + 
            geom_ribbon(aes(ymin=lwr,ymax=upr), data=more_ci, alpha=0.1) + 
            geom_line(aes(y=fit,color=tooth), data=more_ci)
    }
    p + geom_point(aes(y=y,color=tooth)) +
        labs(y=deparse(substitute(y)))
}

# Try it out
look(teeth$gene_ace)

We could treat day as a categorical variable, as in the previous section. However let us treat it as numerical, and see where that leads.

6.1 Transformation

6.1.1 Ace gene

acefit <- lm(gene_ace ~ tooth + day, data=teeth)

look(teeth$gene_ace, acefit)

Two problems:

  1. The actual data appears to be curved, our straight lines are not a good fit.
  2. The predictions fall below zero, a physical impossibility.

In this case, log transformation of the data will solve both these problems.

log2_acefit <- lm( log2(gene_ace) ~ tooth + day, data=teeth)

look(log2(teeth$gene_ace), log2_acefit)

Various transformations of y are possible. Log transformation is commonly used in the context of gene expression. Square root transformation can also be appropriate with nicely behaved count data (technically, if the errors follow a Poisson distribution). This gene expression data is ultimately count based, but is overdispersed compared to the Poisson distribution so square root transformation isn’t appropriate in this case. The Box-Cox transformations provide a spectrum of further options.

6.1.2 Pou3f3 gene

In the case of the Pou3f3 gene, the log transformation is even more important. It looks like gene expression changes at different rates in the upper and lower molars, that is there is a significant interaction between tooth and day.

pou3f3fit0 <- lm(gene_pou3f3 ~ tooth + day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit0)

pou3f3fit1 <- lm(gene_pou3f3 ~ tooth * day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit1)

anova(pou3f3fit0, pou3f3fit1)
## Analysis of Variance Table
## 
## Model 1: gene_pou3f3 ~ tooth + day
## Model 2: gene_pou3f3 ~ tooth * day
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 4849.7                                  
## 2     12  415.2  1    4434.5 128.15 9.234e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(pou3f3fit1)["toothupper:day",]
##     2.5 %    97.5 % 
## -34.65685 -23.46937

The slopes of the lines confidently differ by at least 23.5 RPM per day.

Examining the residuals reveals a further problem: larger expression values are associated with larger residuals.

look(residuals(pou3f3fit1))

plot(predict(pou3f3fit1), residuals(pou3f3fit1))

qqnorm(residuals(pou3f3fit1))
qqline(residuals(pou3f3fit1))

Log transformation both removes the interaction and makes the residuals more uniform (except for one outlier).

log2_pou3f3fit0 <- lm(log2(gene_pou3f3) ~ tooth + day, data=teeth)
log2_pou3f3fit1 <- lm(log2(gene_pou3f3) ~ tooth * day, data=teeth)

anova(log2_pou3f3fit0, log2_pou3f3fit1)
## Analysis of Variance Table
## 
## Model 1: log2(gene_pou3f3) ~ tooth + day
## Model 2: log2(gene_pou3f3) ~ tooth * day
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     13 0.56714                           
## 2     12 0.56579  1 0.0013576 0.0288 0.8681
confint(log2_pou3f3fit1)["toothupper:day",]
##      2.5 %     97.5 % 
## -0.2225600  0.1903981

The difference of the slopes of log2 gene expression between lower and upper molars is confidently between -0.22 and 0.19.

look(log2(teeth$gene_pou3f3), log2_pou3f3fit0)

qqnorm(residuals(log2_pou3f3fit0))
qqline(residuals(log2_pou3f3fit0))

6.2 Curve fitting

6.2.1 Smoc1 gene

log2_smoc1fit <- lm(log2(gene_smoc1) ~ tooth + day, data=teeth)

look(log2(teeth$gene_smoc1), log2_smoc1fit)

In this case, log transformation does not remove the curve. If you think this is a problem for linear models, you are mistaken! With a little feature engineering we can fit a quadratic curve. Calculations can be included in the formula if wrapped in I( ):

curved_fit <- lm(log2(gene_smoc1) ~ tooth + day + I(day^2), data=teeth)
look(log2(teeth$gene_smoc1), curved_fit)

Another way to do this would be to add the column to the data frame:

teeth$day_squared <- teeth$day^2
curved_fit2 <- lm(log2(gene_smoc1) ~ tooth + day + day_squared, data=teeth)

Finally, the poly( ) function can be used in a formula to fit polynomials of arbitrary degree. poly will encode day slightly differently, but produces an equivalent fit.

curved_fit3 <- lm(log2(gene_smoc1) ~ tooth + poly(day,2), data=teeth)
sigma(curved_fit)
## [1] 0.1630425
sigma(curved_fit2)
## [1] 0.1630425
sigma(curved_fit3)
## [1] 0.1630425

poly( ) can also be used to fit higher order polynomials, but these tend to become wobbly and extrapolate poorly. A better option may be to use the ns( ) or bs( ) functions in the splines package, which can be used to fit piecewise “B-splines”. In particular ns( ) (natural spline) is appealing because it extrapolates beyond the ends only with straight lines. If the data is cyclic (for example cell cycle or circadian time series), sine and cosine terms can be used to fit some number of harmonics from a Fourier series.

library(splines)
spline_fit <- lm(log2(gene_smoc1) ~ tooth * ns(day,3), data=teeth)

look(log2(teeth$gene_smoc1), spline_fit)

6.3 Day is confounded with mouse

There may be individual differences between mice. We would like to take this into our account in a model. In general it is common to include batch effect terms in a model, to correctly model the data (and obtain smaller p-values), even if they are not directly of interest.

badfit <- lm(log2(gene_ace) ~ tooth + day + mouse, data=teeth)
summary(badfit)
## 
## Call:
## lm(formula = log2(gene_ace) ~ tooth + day + mouse, data = teeth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20562 -0.02335  0.00000  0.02335  0.20562 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.115920   0.658430 -19.920 2.01e-07 ***
## toothupper   -0.280467   0.070399  -3.984   0.0053 ** 
## day           0.992214   0.040228  24.665 4.59e-08 ***
## mouseind2     0.100073   0.131897   0.759   0.4728    
## mouseind3    -0.198260   0.125612  -1.578   0.1585    
## mouseind4    -0.122056   0.122349  -0.998   0.3517    
## mouseind5    -0.203226   0.122349  -1.661   0.1407    
## mouseind6    -0.009452   0.125612  -0.075   0.9421    
## mouseind7    -0.042441   0.131897  -0.322   0.7570    
## mouseind8           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1408 on 7 degrees of freedom
## Multiple R-squared:  0.9934, Adjusted R-squared:  0.9859 
## F-statistic: 131.9 on 8 and 7 DF,  p-value: 6.131e-07

In this case this is not possible, and R has arbitrarily dropped a predictor from the model. As a different mouse produced data for each different day, mouse is confounded with day. The model matrix suffers from multicollinearity: the day column can be constructed as a linear combination of the intercept column and the mouse columns. There is no single best choice of coefficients.

summary( lm(day ~ mouse, data=teeth) )
## Warning in summary.lm(lm(day ~ mouse, data = teeth)): essentially perfect fit:
## summary may be unreliable
## 
## Call:
## lm(formula = day ~ mouse, data = teeth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.766e-16 -9.130e-18  0.000e+00  9.130e-18  4.766e-16 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.450e+01  2.214e-16 6.550e+16   <2e-16 ***
## mouseind2   5.000e-01  3.131e-16 1.597e+15   <2e-16 ***
## mouseind3   1.000e+00  3.131e-16 3.194e+15   <2e-16 ***
## mouseind4   1.500e+00  3.131e-16 4.791e+15   <2e-16 ***
## mouseind5   2.000e+00  3.131e-16 6.388e+15   <2e-16 ***
## mouseind6   2.500e+00  3.131e-16 7.985e+15   <2e-16 ***
## mouseind7   3.000e+00  3.131e-16 9.582e+15   <2e-16 ***
## mouseind8   3.500e+00  3.131e-16 1.118e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.131e-16 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.06e+31 on 7 and 8 DF,  p-value: < 2.2e-16

Another example of confounding would be an experiment in which each treatment is done in a separate batch.

Even if predictors are not perfectly multicollinear, correlation between predictors can make their coefficient estimates inaccurate. One way to check for this is to attempt to predict each of the predictors with a linear model that uses the remaining predictors (see “Variance Inflation Factor”).

A possible solution to this problem would be to use a “mixed model”, but this is beyond the scope of today’s workshop.

6.4 Challenge - Wnt2 gene (20 minutes)

Look at the expression of gene Wnt2 in column gene_wnt2.

  1. Try some different model formulas.

  2. Justify a particular model by rejecting simpler alternatives using anova( ).

Some notes:

  • I haven’t given you a way to compare different transformations of the response variable gene_wnt2. This is a tricky problem! Here I ask you to eyeball the data or trust me when I say log transformation is appropriate.

  • An alternative approach is to rank models by the Akaike Information Criterion (AIC) or similar. AIC lets you compare models even if they are not nested (but you still can’t compare different transformations of the response variable).

  • Mechanical step-wise comparison of models based on many possible variables raises problems similar to multiple testing, and will not properly control statistical significance. Use significance tests (or AIC) in moderation, taking into account your understanding of the domain. If there are nuisance variables that you are not sure if you need to adjust for, it is ok to leave them in a model whether or not they appear significant.

7 Testing many genes with limma

In this section we look at fitting the same matrix of predictors X to many different sets of responses y. We will use the package limma from Bioconductor. This is a very brief demonstration, and there is much more to this package. See the RNAseq123 tutorial at https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html and the usersguide.pdf at https://bioconductor.org/packages/release/bioc/html/limma.html

library(edgeR)
library(limma)

7.1 Load the data

Actually in the teeth dataset, the expression level of all genes was measured!

counts_df <- read_csv("r-linear-files/teeth-read-counts.csv")
counts <- as.matrix( select(counts_df, -gene) )
rownames(counts) <- counts_df$gene

dim(counts)
## [1] 32544    16
counts[1:5,]
##               ind1lower ind1upper ind2lower ind2upper ind3lower ind3upper
## 0610007P14Rik      1721      1846      1708      1877      1443      1534
## 0610009B22Rik       512       466       554       504       501       498
## 0610009L18Rik        68        71       121       129        74       124
## 0610009O20Rik      2583      2797      2840      2975      2675      2690
## 0610010F05Rik      1079      1246      1262       992      1232       992
##               ind4lower ind4upper ind5lower ind5upper ind6lower ind6upper
## 0610007P14Rik      1389      1439      1699      1881      1697      1716
## 0610009B22Rik       478       427       569       614       577       568
## 0610009L18Rik        83        87        86       125        99       135
## 0610009O20Rik      2429      2313      2693      3093      2854      3066
## 0610010F05Rik      1011       977      1175      1384      1235      1221
##               ind7lower ind7upper ind8lower ind8upper
## 0610007P14Rik      1531      1625      1716      1373
## 0610009B22Rik       587       531       611       453
## 0610009L18Rik       130       132       114        92
## 0610009O20Rik      2487      2904      2713      2455
## 0610010F05Rik      1042       998      1079       882

7.2 Model matrix

The column names match our teeth data frame.

colnames(counts)
##  [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
##  [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"
teeth$sample
##  [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
##  [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"

We will use limma to fit a linear model to each gene. The same model formula will be used in each case. limma doesn’t automatically convert a formula into a model matrix, so we have to do this step manually. Here I am using a model formula that treats the upper and lower teeth as following a different linear trend over time.

X <- model.matrix(~ tooth * day, data=teeth)
X
##    (Intercept) toothupper  day toothupper:day
## 1            1          0 14.5            0.0
## 2            1          1 14.5           14.5
## 3            1          0 15.0            0.0
## 4            1          1 15.0           15.0
## 5            1          0 15.5            0.0
## 6            1          1 15.5           15.5
## 7            1          0 16.0            0.0
## 8            1          1 16.0           16.0
## 9            1          0 16.5            0.0
## 10           1          1 16.5           16.5
## 11           1          0 17.0            0.0
## 12           1          1 17.0           17.0
## 13           1          0 17.5            0.0
## 14           1          1 17.5           17.5
## 15           1          0 18.0            0.0
## 16           1          1 18.0           18.0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$tooth
## [1] "contr.treatment"

limma refers to this matrix as the “design”.

7.3 Remove low expression genes

There is little chance of detecting differential expression in genes with very low read counts. Including these genes will require a larger False Discovery Rate correction, and also confuses limma’s Empirical Bayes parameter estimation. edgeR provides a function filterByExpr that performs their recommended level of filtering. The filtering takes into account the complexity of the model being fitted, so the “design” matrix is also needed.

keep <- filterByExpr(counts, design=X)
table(keep)
## keep
## FALSE  TRUE 
## 15333 17211
counts_kept <- counts[keep, ]
dim(counts_kept)
## [1] 17211    16

7.4 Normalize, log transform

Different samples may have produced different numbers of reads. We now normalize this away by converting the read counts to Reads Per Million, and log2 transform the results. There are some subtleties here which we breeze over lightly: “TMM” normalization is used as a small adjustment to the total number of reads in each sample. A small constant “prior count” is added to the counts to avoid calculating log2(0). The edgeR and limma manuals describe these steps in more detail.

dgelist <- calcNormFactors(DGEList(counts_kept))
dgelist$samples
##           group lib.size norm.factors
## ind1lower     1 37892888    0.9951385
## ind1upper     1 39776241    1.0006489
## ind2lower     1 43124853    0.9938861
## ind2upper     1 40966589    0.9996592
## ind3lower     1 41125302    0.9968494
## ind3upper     1 39605386    1.0118981
## ind4lower     1 37753999    0.9993671
## ind4upper     1 36824277    1.0073017
## ind5lower     1 41660207    1.0034294
## ind5upper     1 49706630    1.0005100
## ind6lower     1 44193014    1.0041810
## ind6upper     1 45913656    1.0068377
## ind7lower     1 39910979    1.0010187
## ind7upper     1 42871935    1.0073428
## ind8lower     1 44654050    0.9822320
## ind8upper     1 40770709    0.9901066
log2_cpms <- cpm(dgelist, log=TRUE, prior.count=1)

log2_cpms[1:5,]
##               ind1lower ind1upper ind2lower ind2upper ind3lower ind3upper
## 0610007P14Rik 5.5129652 5.5361621  5.317361  5.519084 5.1384370  5.259298
## 0610009B22Rik 3.7657231 3.5523692  3.694818  3.624206 3.6140952  3.638097
## 0610009L18Rik 0.8697107 0.8542553  1.509475  1.666292 0.8711053  1.640654
## 0610009O20Rik 6.0985127 6.1353861  6.050597  6.183266 6.0284518  6.069217
## 0610010F05Rik 4.8398640 4.9694229  4.881071  4.599741 4.9105363  4.630906
##               ind4lower ind4upper ind5lower ind5upper ind6lower ind6upper
## 0610007P14Rik  5.203127  5.278662  5.345784  5.242097  5.257910  5.215084
## 0610009B22Rik  3.665953  3.528015  3.769289  3.628794  3.703322  3.621879
## 0610009L18Rik  1.153048  1.244548  1.057459  1.343391  1.173027  1.557940
## 0610009O20Rik  6.009040  5.963023  6.009999  5.959246  6.007542  6.051981
## 0610010F05Rik  4.745215  4.720447  4.814139  4.799770  4.799773  4.724475
##               ind7lower ind7upper ind8lower ind8upper
## 0610007P14Rik  5.260979  5.234627  5.290864  5.089016
## 0610009B22Rik  3.879388  3.622865  3.802657  3.491332
## 0610009L18Rik  1.712786  1.623161  1.391294  1.203557
## 0610009O20Rik  5.960564  6.071824  5.951378  5.926959
## 0610010F05Rik  4.706284  4.531877  4.622031  4.451102

7.5 Fitting a model to and testing each gene

We are now ready to fit a linear model to each gene.

fit <- lmFit(log2_cpms, X)

class(fit)
## [1] "MArrayLM"
## attr(,"package")
## [1] "limma"
fit$coefficients[1:5,]
##               (Intercept) toothupper         day toothupper:day
## 0610007P14Rik    5.807656  1.3158722 -0.03179865    -0.08061826
## 0610009B22Rik    3.136273  0.5999070  0.03696200    -0.04605341
## 0610009L18Rik   -0.902155  1.4890594  0.13042420    -0.08089679
## 0610009O20Rik    6.611072  0.2355852 -0.03671146    -0.01261433
## 0610010F05Rik    5.819858  0.2543831 -0.06338421    -0.02250952

Significance testing in limma is by the use of linear hypotheses (which limma refers to as “contrasts”). A difference between glht and limma’s contrasts.fit is that limma expects the linear combinations to be in columns rather than rows.

We will first look for genes where the slope over time is not flat, averaging the lower and upper teeth.

# Lower slope: c(0,0,1,0)
# Upper slope: c(0,0,1,1)

K <- rbind( average_slope=c(0,0,1,0.5) )
cfit <- contrasts.fit(fit, t(K))         #linear combinations in columns!
efit <- eBayes(cfit, trend=TRUE)

The call to eBayes does Empirical Bayes squeezing of the residual variance for each gene (see Appendix B). This is a bit of magic that allows limma to work well with small numbers of samples.

topTable(efit)
##              logFC  AveExpr         t      P.Value    adj.P.Val        B
## Rnf144b  0.6705081 4.483834  33.64061 8.221793e-16 1.415053e-11 25.97540
## Tfap2b  -0.6416163 7.468730 -28.77129 8.766956e-15 3.581088e-11 23.92465
## Six2    -0.5576619 6.978666 -28.60904 9.548454e-15 3.581088e-11 23.84856
## Vldlr    0.4147078 6.814521  28.51996 1.000874e-14 3.581088e-11 23.80655
## Prom2    0.9165496 3.676863  28.27714 1.138751e-14 3.581088e-11 23.69118
## Ace      0.9858071 2.813200  28.06009 1.279153e-14 3.581088e-11 23.58699
## Stk32a   1.2259001 3.708815  27.81963 1.456488e-14 3.581088e-11 23.47034
## Bambi    0.5612789 6.104506  26.74115 2.643012e-14 5.686111e-11 22.93116
## Msx2     0.8126541 6.253647  25.67587 4.872568e-14 9.317974e-11 22.37141
## Cox4i2   0.5686825 3.809406  24.67488 8.854784e-14 1.523997e-10 21.81903

The column adj.P.Val contains FDR adjusted p-values.

all_results <- topTable(efit, n=Inf)

significant <- all_results$adj.P.Val <= 0.05
table(significant)
## significant
## FALSE  TRUE 
## 10423  6788
ggplot(all_results, aes(x=AveExpr, y=logFC)) + 
    geom_point(size=0.1) +
    geom_point(data=all_results[significant,], size=0.5, color="red")

7.6 Relation to lm( ) and glht( )

Let’s look at a specific gene.

rnf144b <- log2_cpms["Rnf144b",]
rnf144b_fit <- lm(rnf144b ~ tooth * day, data=teeth)
look(rnf144b, rnf144b_fit)

We can use the same linear hypothesis with glht. The estimate is the same, but limma has gained some power by shrinking the variance toward the trend line, so limma’s p-value is smaller.

summary( glht(rnf144b_fit, K) )
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = rnf144b ~ tooth * day, data = teeth)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)    
## average_slope == 0  0.67051    0.01875   35.77 1.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

7.7 Confidence intervals

Confidence Intervals should also be of interest. However note that these are not adjusted for multiple testing (see Appendix B).

topTable(efit, confint=0.95)
##              logFC       CI.L       CI.R  AveExpr         t      P.Value
## Rnf144b  0.6705081  0.6281126  0.7129035 4.483834  33.64061 8.221793e-16
## Tfap2b  -0.6416163 -0.6890509 -0.5941817 7.468730 -28.77129 8.766956e-15
## Six2    -0.5576619 -0.5991235 -0.5162002 6.978666 -28.60904 9.548454e-15
## Vldlr    0.4147078  0.3837783  0.4456372 6.814521  28.51996 1.000874e-14
## Prom2    0.9165496  0.8476051  0.9854941 3.676863  28.27714 1.138751e-14
## Ace      0.9858071  0.9110794  1.0605348 2.813200  28.06009 1.279153e-14
## Stk32a   1.2259001  1.1321692  1.3196310 3.708815  27.81963 1.456488e-14
## Bambi    0.5612789  0.5166334  0.6059244 6.104506  26.74115 2.643012e-14
## Msx2     0.8126541  0.7453317  0.8799765 6.253647  25.67587 4.872568e-14
## Cox4i2   0.5686825  0.5196602  0.6177048 3.809406  24.67488 8.854784e-14
##            adj.P.Val        B
## Rnf144b 1.415053e-11 25.97540
## Tfap2b  3.581088e-11 23.92465
## Six2    3.581088e-11 23.84856
## Vldlr   3.581088e-11 23.80655
## Prom2   3.581088e-11 23.69118
## Ace     3.581088e-11 23.58699
## Stk32a  3.581088e-11 23.47034
## Bambi   5.686111e-11 22.93116
## Msx2    9.317974e-11 22.37141
## Cox4i2  1.523997e-10 21.81903

7.8 F test

limma can also test several simultaneous constraints on linear combinations of coefficients. Suppose we want to find any deviation from a constant expression level. We can check for this with:

K2 <- rbind(
    c(0,1,0,0),
    c(0,0,1,0),
    c(0,0,0,1))

cfit2 <- contrasts.fit(fit, t(K2))
efit2 <- eBayes(cfit2, trend=TRUE)
topTable(efit2)
##               Coef1      Coef2       Coef3   AveExpr        F      P.Value
## Pou3f3    3.9677657 -0.3367111 -0.01422066 5.1309354 508.0331 1.388933e-15
## Rnf144b  -1.4817168  0.6363724  0.06827130 4.4838343 400.3615 8.469656e-15
## Six2     -0.8370468 -0.6082833  0.10124287 6.9786656 384.2097 1.157124e-14
## Tfap2b   -4.3550546 -0.7909806  0.29872853 7.4687297 322.7085 4.330436e-14
## Prom2     0.5804985  0.9612842 -0.08946912 3.6768631 313.2648 5.419522e-14
## Vldlr     2.1955450  0.4751431 -0.12087066 6.8145206 292.9660 8.985449e-14
## Nkx2-3  -12.4481924 -0.4507673  0.20535681 0.4715549 280.1517 1.258886e-13
## Ace      -0.5231634  0.9784323  0.01474955 2.8132003 266.6050 1.828639e-13
## Stk32a    0.8473950  1.2638212 -0.07584210 3.7088148 263.0716 2.021843e-13
## Bambi     1.5414027  0.6171353 -0.11171292 6.1045062 251.5382 2.832907e-13
##            adj.P.Val
## Pou3f3  2.390492e-11
## Rnf144b 6.638423e-11
## Six2    6.638423e-11
## Tfap2b  1.863278e-10
## Prom2   1.865508e-10
## Vldlr   2.577476e-10
## Nkx2-3  3.095240e-10
## Ace     3.866438e-10
## Stk32a  3.866438e-10
## Bambi   4.875716e-10

A shortcut here would be to skip the contrasts.fit step and specify a set of coefficients directly to topTable( ).

7.9 Further exercises

  1. Construct and use linear combinations to find genes that:

    A. Differ in slope between lower and upper molars.

    B. Differ in expression on day 16 between the lower and upper molars. (Hint: this can be viewed as the difference in predictions between two individual samples.)

    C. Construct a pair of linear combinations that when used together in an F test find genes with non-zero slope in either or both the lower and upper molars.

  2. Construct a model matrix in which the change over time is quadratic, and some linear combinations to interrogate it.

8 Appendix A - mixed models

In this workshop we have looked at linear models with fixed effects only. A linear mixed model contains both fixed and random effects. Random effects are appropriate for individuals drawn from a population. The model we obtain will include how much variation there is in this population.

Mixed models would be a whole further workshop, but I will attempt a rough outline here.

In R, mixed models are most commonly fitted using the lmer function in lme4. A point of confusion you will encounter is that there are a variety of ways to perform significance tests. The authors of lme4 duck out of this debate entirely and only provide t statistics. However packages such as lmerTest, emmeans and pbkrtest can provide p-values and CIs from an lmer model.

My recommendation of a safe choice, from my reading, is to fit a model using the REML method (the default for lmer) and then test it using the Kenward-Roger method. The Satterthwaite method is also decent, usually producing nearly identical results with less computation. Other methods may produce p-values that are too small with small data sets. This includes the multcomp package that we used earlier, unfortunately. For more information see:

Luke, S.G. (2017). Evaluating significance in linear mixed-effects models in R. Behav Res 49, 1494–1502 https://doi.org/10.3758/s13428-016-0809-y

The differences between methods disappear when your data is large enough, but “large enough” might mean not just many rows in your data frame but also many levels for your random effects.

I will briefly demonstrate lmer with the PVC particle dataset, now treating resin batches as a random effect.

library(lmerTest)     # lmerTest builds on lme4, can produce p-values
library(emmeans)      # emmeans can provide Kenward-Roger CIs

pvc <- read_csv("r-linear-files/pvc.csv")
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)

pvcfit_mixed <- lmer(psize ~ operator + (1|resin), data=pvc, REML=TRUE)
pvcfit_mixed
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: psize ~ operator + (1 | resin)
##    Data: pvc
## REML criterion at convergence: 172.2304
## Random effects:
##  Groups   Name        Std.Dev.
##  resin    (Intercept) 2.558   
##  Residual             1.145   
## Number of obs: 48, groups:  resin, 8
## Fixed Effects:
##  (Intercept)   operatorBob  operatorCarl  
##      32.9438       -0.2625       -1.5063

We use new ( | ) notation to specify random effects. Right of the | you give the factor specifying individuals from a population. Left of the | you give predictors whose coefficients may vary between individuals. In the example above we are saying that the intercept may vary between individuals.

The mean about which individuals vary must also be included as a fixed effect in the model! Here an intercept term is implicitly included in the model as usual. However if your random effect specified a slope or some-such it would also need to be given as a fixed effect. Otherwise you would be saying the slope varies around zero.

A standard deviation has been estimated for the effect of different batches of resin on particle size, as well as the usual residual standard deviation. This is an extra source of random variation around the prediction from the fixed effect part of the model.

drop1 from lmerTest can be used to test whether fixed effects are statistically significant.

drop1(pvcfit_mixed, ddf="Kenward-Roger")
## Single term deletions using Kenward-Roger's method:
## 
## Model:
## psize ~ operator + (1 | resin)
##          Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)   
## operator 20.718  10.359     2    38   7.902 0.00135 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ranova from lmerTest can be used to test whether random effects are statistically significant.

ranova(pvcfit_mixed)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## psize ~ operator + (1 | resin)
##             npar   logLik    AIC    LRT Df Pr(>Chisq)    
## <none>         5  -86.115 182.23                         
## (1 | resin)    4 -113.096 234.19 53.961  1  2.045e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emmeans can be used on the fixed effects of the model.

result <- emmeans(pvcfit_mixed, pairwise ~ operator, lmer.df="Kenward-Roger")
result$emmeans
##  operator emmean    SE   df lower.CL upper.CL
##  Alice      32.9 0.949 7.93     30.8     35.1
##  Bob        32.7 0.949 7.93     30.5     34.9
##  Carl       31.4 0.949 7.93     29.2     33.6
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95
confint(result$contrasts)
##  contrast     estimate    SE df lower.CL upper.CL
##  Alice - Bob     0.263 0.405 38   -0.725     1.25
##  Alice - Carl    1.506 0.405 38    0.519     2.49
##  Bob - Carl      1.244 0.405 38    0.257     2.23
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

For comparison, here are similar results from the fixed effects model:

pvcfit_fixed <- lm(psize ~ operator + resin, data=pvc)

drop1(pvcfit_fixed, test="F")
## Single term deletions
## 
## Model:
## psize ~ operator + resin
##          Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                 49.82 21.782                      
## operator  2    20.718  70.53 34.474   7.902   0.00135 ** 
## resin     7   283.946 333.76 99.083  30.943 8.111e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
result_fixed <- emmeans(pvcfit_fixed, pairwise ~ operator)
result_fixed$emmeans
##  operator emmean    SE df lower.CL upper.CL
##  Alice      32.9 0.286 38     32.4     33.5
##  Bob        32.7 0.286 38     32.1     33.3
##  Carl       31.4 0.286 38     30.9     32.0
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95
confint(result_fixed$contrasts)
##  contrast     estimate    SE df lower.CL upper.CL
##  Alice - Bob     0.263 0.405 38   -0.725     1.25
##  Alice - Carl    1.506 0.405 38    0.519     2.49
##  Bob - Carl      1.244 0.405 38    0.257     2.23
## 
## Results are averaged over the levels of: resin 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

The simpler fixed effects model can be a useful check that you are fitting the mixed effects model you intend. The results will usually be similar, and any differences may be instructive. A fixed effect model might be chosen either by ignoring random effects or including them as fixed effects. In this example we see:

  • The pairwise comparisons are identical.

  • CIs for the means for each operator are wider for the mixed model. This is because they are an estimate of the mean for the whole population of resins. For the fixed effects model, we get the CI for the mean of exactly the eight resins present in the data.

9 Appendix B - more on limma

9.1 Empirical Bayes variance squeezing

In limma, Empirical Bayes squeezing of the residual variance acts as though we have some number of extra “prior” observations of the variance. These are also counted as extra degrees of freedom in F tests. The “prior” observations act to squeeze the estimated residual variance toward a trend line that is a function of the average expression level.

efit <- eBayes(cfit, trend=TRUE)

efit$df.prior
## [1] 3.36434
efit$df.residual
##  [1] 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
##  [ reached getOption("max.print") -- omitted 17190 entries ]
efit$df.total
##  [1] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
##  [9] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
## [17] 15.36434 15.36434 15.36434 15.36434 15.36434
##  [ reached getOption("max.print") -- omitted 17190 entries ]
plotSA(efit)
points(efit$Amean, efit$s2.post^0.25, col="red", cex=0.2)

The total effective degrees of freedom is the “prior” degrees of freedom plus the normal residual degrees of freedom. As can be seen in the plot, compared to the residual variance (black dots), this produces a posterior residual variance (efit$s2.post, red dots) that is squeezed toward the trend line.

It’s worthwhile checking df.prior when using limma, as a low value may indicate a problem with a data-set.

9.2 Top confident effect sizes

A little self-promotion: With limma we are able to find genes where our effect of interest is significantly different from zero. However we may make many thousands of discoveries, too many to easily follow up, and some of the effects discovered may be tiny. I have developed a method to rank genes by confident effect size, with multiple testing correction, implemented in the package topconfects. This method builds on the treat method provided by limma.

This is performed with the limma_confects function. limma_confects incorporates the Empirical Bayes variance squeezing step, so remember to specify trend=TRUE if using trend based variance squeezing.

library(topconfects)

result <- limma_confects(cfit, "average_slope", trend=TRUE, fdr=0.05)
result
## $table
##    confect effect AveExpr name   
## 1  -1.411  -2.655  1.555  Hbb-y  
## 2   1.356   2.011  4.139  Mmp13  
## 3   1.318   2.302 -1.784  Scgn   
## 4   1.290   1.796  2.480  Ecel1  
## 5   1.248   1.843 -1.503  Sdr16c6
## 6   1.233   2.095  1.197  Tbx22  
## 7  -1.201  -1.701  0.688  Hba-x  
## 8   1.171   2.015  3.501  Igfbpl1
## 9   1.075   1.706  5.724  Vwde   
## 10  1.075   1.694  3.209  Adgrf4 
## ...
## 6788 of 17211 non-zero log2 fold change at FDR 0.05
## Prior df 3.4

Results are ranked by “confect”, confident effect size. If you select genes with abs(confect) >= some value, those genes will have a magnitude of effect greater than that threshold, with controlled FDR. (Exactly the same set of genes is found using treat with this threshold.)

Compared to the “most significant” gene, the top gene by topconfects has a larger slope (but lower overall average expression).

hbby <- log2_cpms["Hbb-y",]
hbby_fit <- lm(hbby ~ tooth * day, data=teeth)
look(hbby, hbby_fit)

# Highlight genes with slope confidently larger than 0.2 per day.
ggplot(result$table, aes(x=AveExpr, y=effect)) +
    geom_point(size=0.1) +
    geom_point(
        data=filter(result$table, abs(confect) >= 0.2), 
        size=0.5, color="red")

See also “False Coverage-statement Rate” for a generally applicable approach to multiple-testing corrected confidence intervals.


sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## time zone: Australia/Melbourne
## tzcode source: internal
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] topconfects_1.18.0 lmerTest_3.1-3     lme4_1.1-35.3      Matrix_1.6-5      
##  [5] lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4       
##  [9] purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1      
## [13] ggplot2_3.5.0      tidyverse_2.0.0    edgeR_4.0.16       limma_3.58.1      
## [17] emmeans_1.10.1     multcomp_1.4-25    TH.data_1.1-2      survival_3.5-8    
## [21] mvtnorm_1.2-4      MASS_7.3-60.0.1   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4        xfun_0.43           bslib_0.7.0        
##  [4] lattice_0.22-6      numDeriv_2016.8-1.1 tzdb_0.4.0         
##  [7] vctrs_0.6.5         tools_4.3.3         generics_0.1.3     
## [10] pbkrtest_0.5.2      parallel_4.3.3      sandwich_3.1-0     
## [13] fansi_1.0.6         highr_0.10          pkgconfig_2.0.3    
## [16] assertthat_0.2.1    lifecycle_1.0.4     farver_2.1.1       
## [19] compiler_4.3.3      statmod_1.5.0       munsell_0.5.1      
## [22] codetools_0.2-20    htmltools_0.5.8.1   sass_0.4.9         
## [25] yaml_2.3.8          nloptr_2.0.3        pillar_1.9.0       
## [28] crayon_1.5.2        jquerylib_0.1.4     cachem_1.0.8       
## [31] boot_1.3-30         nlme_3.1-164        tidyselect_1.2.1   
## [34] locfit_1.5-9.9      digest_0.6.35       stringi_1.8.3      
## [37] labeling_0.4.3      fastmap_1.1.1       grid_4.3.3         
## [40] colorspace_2.1-0    cli_3.6.2           magrittr_2.0.3     
## [43] utf8_1.2.4          broom_1.0.5         withr_3.0.0        
## [46] backports_1.4.1     scales_1.3.0        bit64_4.0.5        
## [49] estimability_1.5    timechange_0.3.0    rmarkdown_2.26     
## [52] bit_4.0.5           zoo_1.8-12          hms_1.1.3          
## [55] coda_0.19-4.1       evaluate_0.23       knitr_1.46         
## [58] mgcv_1.9-1          rlang_1.1.3         Rcpp_1.0.12        
## [61] xtable_1.8-4        glue_1.7.0          minqa_1.2.6        
## [64] vroom_1.6.5         jsonlite_1.8.8      R6_2.5.1

Home