---
title: "glmmSeq"
author: "Myles Lewis, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk, Anna Surace"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
fig_width: 8
fig_height: 5
vignette: >
%\VignetteIndexEntry{glmmSeq}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r setup, include=FALSE}
options(width=88)
library(kableExtra)
```
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
[](https://choosealicense.com/licenses/mit/)
[](https://CRAN.R-project.org/package=glmmSeq)
[](https://CRAN.R-project.org/package=glmmSeq)
`r paste0("[![", Sys.Date(),"]","(",paste0("https://img.shields.io/badge/last%20git%20commit-", gsub('-', '--', Sys.Date()),"-turquoise.svg"), ")]","(",'https://github.com/myles-lewis/glmmSeq/blob/master/NEWS.md',")")`
[](https://GitHub.com/myles-lewis/glmmSeq/issues/)
[](https://github.com/myles-lewis/glmmSeq)
# glmmSeq
The aim of this package is to model gene expression with a general linear mixed
model (GLMM) as described in the R4RA study [1]. The most widely used mainstream
differential gene expression analysis tools (e.g
[Limma](https://doi.org/10.1093/nar/gkv007),
[DESeq2](https://doi.org/10.1186/s13059-014-0550-8),
[edgeR](https://doi.org/10.1093/bioinformatics/btp616)) are all unable to fit
mixed effects linear models. This package however fits negative binomial mixed
effects models at individual gene level using the `negative.binomial` function
from `MASS` and the `glmer` function in
[`lme4`](https://CRAN.R-project.org/package=lme4) which enables random effect,
as as well as mixed effects, to be modelled.
### Installing from CRAN
```{r, eval=FALSE}
install.packages("glmmSeq")
```
### Installing from Github
```{r, eval=FALSE}
devtools::install_github("myles-lewis/glmmSeq")
```
### Installing Locally
Or you can source the functions individually:
```{r, eval=FALSE}
functions = list.files("./R", full.names = TRUE)
invisible(lapply(functions, source))
```
But you will need to load in the additional libraries:
```{r, eval=FALSE}
# Install CRAN packages
invisible(lapply(c("MASS", "car", "ggplot2", "ggpubr", "lme4","lmerTest",
"methods", "parallel", "plotly", "pbapply", "pbmcapply"),
function(p){
if(! p %in% rownames(installed.packages())) {
install.packages(p)
}
library(p, character.only=TRUE)
}))
# Install BioConductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
invisible(lapply(c("qvalue"), function(p){
if(! p %in% rownames(installed.packages())) BiocManager::install(p)
library(p, character.only=TRUE)
}))
```
# Overview
To get started, first we load in the package:
```{r, message=FALSE, warning=FALSE}
library(glmmSeq)
set.seed(1234)
```
This vignette will demonstrate the power of this package using a minimal example
from the [PEAC data set](https://peac.hpc.qmul.ac.uk/). Here we focus on the
synovial biopsy RNA-Seq data from this cohort of patients with early rheumatoid
arthritis.
```{r}
data(PEAC_minimal_load)
```
This data contains:
- metadata: which describes each sample. Including patient ID, sample time-point,
and six-month EULAR response. Where
[EULAR](https://www.das-score.nl/en/das-and-das28/das28-why/eular-response-criteria)
is a rheumatoid arthritis response metric based on composite
[DAS28 scores](https://www.das-score.nl/en/).
- tpm: the transcript per million RNA-seq count data
These are outlined in the following subsections.
## Metadata
```{r}
metadata$EULAR_binary = NA
metadata$EULAR_binary[metadata$EULAR_6m %in%
c("Good", "Moderate" )] = "responder"
metadata$EULAR_binary[metadata$EULAR_6m %in% c("Non-response")] = "non_responder"
kable(head(metadata), row.names = F) %>% kable_styling()
```
## Count data
```{r}
kable(head(tpm)) %>% kable_styling() %>%
scroll_box(width = "100%")
```
## Dispersion
Using negative binomial models requires gene dispersion estimates to be made.
This can be achieved in a number of ways. A common way to calculate this for
gene _i_ is to use the equation:
Dispersioni = (variancei - meani)/meani2
This can be calculated using:
```{r}
disp <- apply(tpm, 1, function(x){
(var(x, na.rm=TRUE)-mean(x, na.rm=TRUE))/(mean(x, na.rm=TRUE)**2)
})
head(disp)
```
Alternatively, we recommend using _edgeR_ to estimate of the common, trended and
tagwise dispersions across all tags:
```{r, eval=FALSE}
disp <- setNames(edgeR::estimateDisp(tpm)$tagwise.dispersion, rownames(tpm))
head(disp)
```
or with _DESeq2_ using the raw counts:
```{r, eval=FALSE}
dds <- DESeqDataSetFromTximport(txi = txi, colData = metadata, design = ~ 1)
dds <- DESeq(dds)
dispersions <- setNames(dispersions(dds), rownames(txi$counts))
```
## Size Factors
There is also an option to include size factors for each gene. Again this can be
estimated using:
```{r}
sizeFactors <- colSums(tpm)
sizeFactors <- sizeFactors / mean(sizeFactors) # normalise to mean = 1
head(sizeFactors)
```
Or using edgeR these can be calculated from the raw read counts:
```{r, eval=FALSE}
sizeFactors <- calcNormFactors(counts, method="TMM")
```
Similarly, with DESeq2:
```{r, eval=FALSE}
sizeFactors <- estimateSizeFactorsForMatrix(counts)
```
Note the `sizeFactors` vector needs to be centred around 1, since it used
directly as an offset of form `log(sizeFactors)` in the GLMM model.
# Fitting Models
To fit a model for one gene over time we use a formula such as:
> gene expression ~ fixed effects + random effects
In R the formula is defined by both the fixed-effects and random-effects part
of the model, with the response on the left of a ~ operator and the
terms, separated by + operators, on the right. Random-effects terms are
distinguished by vertical bars ("|") separating expressions for design matrices
from grouping factors. For more information see the `?lme4::glmer`.
In this case study we want to use time and response as fixed effects and the
patients as random effects:
gene expression ~ time + response + (1 | patient)
To fit this model for all genes we can use the `glmmSeq` function. Note that
this analysis can take some time, with 4 cores:
- 1000 genes takes about 10 seconds
- 20000 genes takes about 4 mins
```{r, warning=FALSE}
results <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
countdata = tpm,
metadata = metadata,
dispersion = disp,
progress = TRUE)
```
or alternatively using two-factor classification with EULAR_binary:
```{r, warning=FALSE}
results2 <- glmmSeq(~ Timepoint * EULAR_binary + (1 | PATID),
countdata = tpm,
metadata = metadata,
dispersion = disp)
```
## Outputs
This creates a GlmmSeq object which contains the following slots:
```{r}
names(attributes(results))
```
The variables used by the model are in the `@modeldata`:
```{r}
kable(results@modelData) %>% kable_styling()
```
The model fit statistics can be viewed in the `@stats` slot which is a list of
items including fitted model coefficients, their standard errors and the results
of statistical tests on terms within the model using Wald type 2 Chi square. To
see the most significant interactions we can order `pvals` by
`Timepoint:EULAR_6m`:
```{r}
stats <- summary(results)
kable(stats[order(stats[, 'P_Timepoint:EULAR_6m']), ]) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
```
Summary statistics on a single gene can be shown specifying the argument `gene`.
```{r}
summary(results, gene = "MS4A1")
```
Estimated means based on each gene's fitted model to show fixed effects and
their 95% confidence intervals can be seen in the `@predict` slot:
```{r}
predict = data.frame(results@predict)
kable(predict) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
```
## Q-values
The q-values from each of the p-value columns can be calculated using
`glmmQvals`. This will print a significance table based on the cut-off (default
p=0.05) and add a matrix named `qvals` to the `@stats` slot:
```{r}
results <- glmmQvals(results)
```
# glmmTMB models
The `glmmTMB` package is an alternative to `lme4` for fitting negative binomial
GLMM models. Note, that this package optimises the dispersion parameter for each
GLMM model. This has the advantage that a predefined list of dispersions for
each gene is not required. But it also means that model fitting is significantly
slower than a negative binomial GLMM fit by `lme4::glmer` with family function
`MASS::negative.binomial` for which the dispersion parameter must be
pre-specified. Each `glmmTMB` model takes about 0.8 seconds, so that even with 8
cores, a typical transcriptome of 20,000 genes would take about 30 minutes,
although this will vary depending on the complexity of the design formula. In
comparison, the standard `glmer` pipeline with known dispersions supplied takes
about 2 minutes on 8 cores.
Although `glmmTMB` is slower than `lme::glmer` with known dispersion, `glmmTMB`
is still faster than the equivalent `lme4` function `glmer.nb` when the
dispersion is unknown.
Setting `method = "glmmTMB"` when calling `glmmSeq()` switches from using
`lme4::glmer` (the default) to the `glmmTMB` package. The `dispersion` argument
is then no longer required. The `family` argument can be used to specify
different family functions and `glmmTMB` provides `nbinom1` and `nbinom2`
(`glmmSeq` uses the latter by default). Other family functions e.g. `poisson`
can be used.
# Gaussian mixed effects models
An alternative to fitting negative binomial GLMM is to use transformed data and
fit gaussian LMM. The `lmmSeq` function will fit many LMM across genes (or other
data) where random effects/mixed effects models are useful. It is almost 3x
faster than `glmmSeq`. For example, gaussian normalised DNA methylation data can
be analysed using this method. In the example below the RNA-Seq gene expression
data is log transformed so that it is approximately Gaussian. Alternatively
variance stabilising transformation (VST) can be applied to count data through
DESeq2 or the voom transformation in limma voom.
```{r, warning=FALSE}
logtpm <- log2(tpm + 1)
lmmres <- lmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
maindata = logtpm,
metadata = metadata,
progress = TRUE)
summary(lmmres, "MS4A1")
```
# Hypothesis testing
By default, p-values for each model term are computed using Wald type 2
Chi-squared test as per `car::Anova()`. The underlying code for this has been
optimised for speed.
For LMM via `lmmSeq()`, an alternative to the Wald type 2 Chi-squared test is an
F-test with Satterthwaite denominator degrees of freedom, which has been
implemented using the `lmerTest` package and is enabled using `test.stat = "F"`.
This is not available for GLMM.
However, if a reduced model formula is specified by setting `reduced`, then a
likelihood ratio test is performed instead using `anova`. This will double
computation time since two (G)LMM have to be fitted for each gene. For further
information on inference testing on GLMM, we recommend
[Ben Bolker's GLMM FAQ page](https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html).
```{r, warning=FALSE}
glmmLRT <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
reduced = ~ Timepoint + EULAR_6m + (1 | PATID),
countdata = tpm,
metadata = metadata,
dispersion = disp, verbose = FALSE)
summary(glmmLRT, "MS4A1")
```
# Extracting individual Genes
Similarly you can run the script for an individual gene (make sure you use
`drop = FALSE` to maintain `countdata` as a matrix).
```{r, warning=FALSE}
MS4A1glmm <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
countdata = tpm["MS4A1", , drop = FALSE],
metadata = metadata,
dispersion = disp,
verbose = FALSE)
```
The (g)lmer or glmmTMB fitted object for a single gene can be obtained using the
function `glmmRefit()`:
```{r, warning=FALSE}
fit <- glmmRefit(results, gene = "MS4A1")
fit
```
These (g)lmer objects can be passed to other packages notably `emmeans` for
visualising estimated marginal means. Note these are based on the model, not
directly on the data.
```{r, warning=FALSE}
library(emmeans)
emmeans(fit, ~ Timepoint | EULAR_6m)
emmip(fit, ~ Timepoint | EULAR_6m)
```
## Refitting a different model
`glmmRefit()` allows a different model to be fitted using the original data. The
example below shows how to refit the model without the interaction term and then
perform a likelihood ratio test using `anova`. Note for `glmmSeq()` objects the
LHS of the reduced formula must be `"count"`, while for `lmmSeq()` objects the
LHS must be `"gene"`. For glmmTMB analyses, the GLM family can also be changed.
```{r, warning=FALSE}
fit2 <- glmmRefit(results, gene = "MS4A1",
formula = count ~ Timepoint + EULAR_6m + (1 | PATID))
anova(fit, fit2)
```
# Model Plots
For variables such as time, which are matched according to an ID (the random
effect), we can examine the fitted model using plots which show estimated means
and confidence intervals based on coefficients for the fitted regression model,
overlaid upon the underlying data. In this case the samples are matched
longitudinally over time.
Plots can be viewed using either ggplot or base graphics. We can start looking
at the gene with the most significant interaction _IGHV3-23_:
```{r, fig.height=6, warning=FALSE}
plotColours <- c("skyblue", "goldenrod1", "mediumseagreen")
modColours <- c("dodgerblue3", "goldenrod3", "seagreen4")
shapes <- c(17, 19, 18)
ggmodelPlot(results,
geneName = "IGHV3-23",
x1var = "Timepoint",
x2var="EULAR_6m",
xlab="Time",
colours = plotColours,
shapes = shapes,
lineColours = plotColours,
modelColours = modColours,
modelSize = 10)
```
Alternatively plots can be generated using base graphics, here with or without
the model fit overlaid. By default p-value labels are taken from the column
names of the `pvals` object in the `stats` slot of the S4 result object. These
can be relabelled using the `plab` argument.
```{r, fig.height=6, warning=FALSE}
oldpar <- par(mfrow=c(1, 2))
modelPlot(results2,
geneName = "FGF14",
x1var = "Timepoint",
x2var="EULAR_binary",
fontSize=0.65,
colours=c("coral", "mediumseagreen"),
modelColours = c("coral", "mediumseagreen"),
modelLineColours = "black",
modelSize = 2)
modelPlot(results,
geneName = "EMILIN3",
x1var = "Timepoint",
x2var = "EULAR_6m",
colours = plotColours,
plab = c("time", "response", "time:response"),
addModel = FALSE)
par(oldpar)
```
To plot the model fits alone set `addPoints = FALSE`.
```{r, message=FALSE}
library(ggpubr)
p1 <- ggmodelPlot(results,
"ADAM12",
x1var="Timepoint",
x2var="EULAR_6m",
xlab="Time",
addPoints = FALSE,
colours = plotColours)
p2 <- ggmodelPlot(results,
"EMILIN3",
x1var="Timepoint",
x2var="EULAR_6m",
xlab="Time",
fontSize=8,
x2Offset=1,
addPoints = FALSE,
colours = plotColours)
ggarrange(p1, p2, ncol=2, common.legend = T, legend="bottom")
```
## Fold change plots
The comparative fold change (for `x1var` variables) between conditions (`x2var`
and `x2Values` variables) can be plotted using an `fcPlot` for all genes to
highlight significance. This type of plot is most suited to look for interaction
between time (`x1var`) and a two-level factor (`x2var`), looking at change
between two timepoints. In the example below from the R4RA study [1], gene
expression pre- and post-drug treatment is compared between two drugs (rituximab
& tocilizumab), using the design formula `gene_counts ~ time * drug`. By setting
`graphics = "plotly"` this can be viewed interactively.
```{r, eval = FALSE}
r4ra_glmm <- glmmSeq(~ time * drug + (1 | Patient_ID),
countdata = tpmdata, metadata,
dispersion = dispersions, cores = 8, removeSingles = T)
r4ra_glmm <- glmmQvals(r4ra_glmm)
labels = c(..) # Genes to label
fcPlot(r4ra_glmm, x1var = "time", x2var = "drug", graphics = "plotly",
pCutoff = 0.05, useAdjusted = TRUE,
labels = labels,
colours = c('grey', 'green3', 'gold3', 'blue'))
```
```{r fcplot, echo = FALSE, message=FALSE, fig.align='center', out.width='80%', out.extra='style="border: 0;"'}
knitr::include_graphics("r4ra_glmm_fcplot.png")
```
Log2 fold change between the two time points for individuals treated with
rituximab on the x axis and individuals treated with tocilizumab on the y axis
with each point representing a gene. Genes showing an interaction between time
and drug are coloured blue or gold depending on whether their fold change is
greater post-rituximab (blue) or post-tocilizumab (gold). Genes without
interaction, but changing significantly over time are coloured green and tend to
lie along the line of identity. See the Longitudinal tab in
for an interactive version of the above plot.
## MA plots
An MA plot is an application of a Bland–Altman plot. The plot visualizes the
differences between measurements taken in two samples, by transforming the data
onto M (log ratio) and A (mean average) scales, then plotting these values.
```{r, fig.height=8}
labels = c('MS4A1', 'FGF14', 'IL2RG', 'IGHV3-23', 'ADAM12', 'IL36G',
'BLK', 'SAA1', 'CILP', 'EMILIN3', 'EMILIN2', 'IGHJ6',
'CXCL9', 'CXCL13')
maPlots <- maPlot(results,
x1var="Timepoint",
x2var="EULAR_6m",
x2Values=c("Good", "Non-response"),
colours=c('grey', 'midnightblue',
'mediumseagreen', 'goldenrod'),
labels=labels,
graphics="ggplot")
maPlots$combined
```
# Troubleshooting
Mixed effects models are tricky to fit and `lme4::glmer` sometimes returns
errors. In fact, a significant amount of code in `glmmSeq` is devoted to
catching and handling errors to allow parallelisation to continue. Errors in
genes are stored in the `errors` slot.
```{r}
results@errors[1] # first gene error
```
Sometimes `glmmSeq` returns errors in all genes. This usually means a problem
with not enough samples in each timepoint or a mistake in the formula. Since
version 0.5.5 `glmmSeq` now returns a vector of the error messages for all
genes, which can be useful for debugging. In the example below, only timepoint 0
is specified.
```{r, error=TRUE}
results3 <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
countdata = tpm[, metadata$Timepoint == 0],
metadata = metadata[metadata$Timepoint == 0, ],
dispersion = disp)
```
If there are errors which are not caught by the error checking core mechanism,
this can lead to problems with grouping results after the core models have been
fit. Setting `returnList=TRUE` when calling `glmmSeq` returns the list output
direct from `mclapply` (or `parLapply` on windows). This can be helpful for
debugging unforeseeen problems in the core loop.
# Citing glmmSeq
glmmSeq was developed by the bioinformatics team at the [Experimental Medicine & Rheumatology department](https://www.qmul.ac.uk/whri/emr/) and [Centre for Translational Bioinformatics](https://www.qmul.ac.uk/c4tb/) at Queen Mary University London.
If you use this package please cite as:
```{r, warning=FALSE}
citation("glmmSeq")
```
# References
1. Felice Rivellese, Anna Surace, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk, Giovanni Giorli, ... Michael Barnes, Myles J. Lewis, Costantino Pitzalis, R4RA collaborative group. Rituximab versus tocilizumab in rheumatoid arthritis: synovial biopsy-based biomarker analysis of the phase 4 R4RA randomized trial. Nature medicine 2022; 28(6): 1256-68. [doi:10.1038/s41591-022-01789-0](https://doi.org/10.1038/s41591-022-01789-0)
Statistical software used in this package:
2. [lme4](https://dx.doi.org/10.18637/jss.v067.i01): Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi: 10.18637/jss.v067.i01.
3. [car](https://socialsciences.mcmaster.ca/jfox/Books/Companion/): John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
4. [MASS](https://www.stats.ox.ac.uk/pub/MASS4/VR4stat.pdf): Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
5. [glmmTMB](https://CRAN.R-project.org/package=glmmTMB): Mollie Brooks, Ben Bolker, Kasper Kristensen, Martin Maechler, Arni Magnusson (2022). glmmTMB: Generalized Linear Mixed Models using Template Model Builder
6. [qvalue](https://github.com/StoreyLab/qvalue): John D. Storey, Andrew J. Bass, Alan Dabney and David Robinson (2020). qvalue: Q-value estimation for false discovery rate control. R package version 2.22.0. https://github.com/StoreyLab/qvalue
7. [lmerTest](https://CRAN.R-project.org/package=lmerTest): Alexandra Kuznetsova, Per Brockhoff, Rune Christensen, Sofie Jensen. lmerTest: Tests in Linear Mixed Effects Models
8. [emmeans](https://CRAN.R-project.org/package=emmeans): Russell V. Lenth, Paul Buerkner, Maxime Herve, Jonathon Love, Fernando Miguez, Hannes Riebl, Henrik Singmann. emmeans: Estimated Marginal Means, aka Least-Squares Means