--- title: "glmmSeq" author: "Myles Lewis, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk, Anna Surace" date: "`r Sys.Date()`" output: html_document: toc: true toc_float: true fig_width: 8 fig_height: 5 vignette: > %\VignetteIndexEntry{glmmSeq} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} options(width=88) library(kableExtra) ``` [![Lifecycle: Maturing](https://img.shields.io/badge/lifecycle-stable-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable) [![License: MIT](https://img.shields.io/badge/License-MIT-mediumpurple.svg)](https://choosealicense.com/licenses/mit/) [![CRAN status](https://www.r-pkg.org/badges/version/glmmSeq)](https://CRAN.R-project.org/package=glmmSeq) [![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/glmmSeq?color=orange)](https://CRAN.R-project.org/package=glmmSeq) `r paste0("[![", Sys.Date(),"]","(",paste0("https://img.shields.io/badge/last%20git%20commit-", gsub('-', '--', Sys.Date()),"-turquoise.svg"), ")]","(",'https://github.com/myles-lewis/glmmSeq/blob/master/NEWS.md',")")` [![GitHub issues](https://img.shields.io/github/issues/myles-lewis/glmmSeq.svg)](https://GitHub.com/myles-lewis/glmmSeq/issues/) [![Travis](https://img.shields.io/travis/myles-lewis/glmmSeq.svg)](https://github.com/myles-lewis/glmmSeq) # glmmSeq

The aim of this package is to model gene expression with a general linear mixed model (GLMM) as described in the R4RA study [1]. The most widely used mainstream differential gene expression analysis tools (e.g [Limma](https://doi.org/10.1093/nar/gkv007), [DESeq2](https://doi.org/10.1186/s13059-014-0550-8), [edgeR](https://doi.org/10.1093/bioinformatics/btp616)) are all unable to fit mixed effects linear models. This package however fits negative binomial mixed effects models at individual gene level using the `negative.binomial` function from `MASS` and the `glmer` function in [`lme4`](https://CRAN.R-project.org/package=lme4) which enables random effect, as as well as mixed effects, to be modelled. ### Installing from CRAN ```{r, eval=FALSE} install.packages("glmmSeq") ``` ### Installing from Github ```{r, eval=FALSE} devtools::install_github("myles-lewis/glmmSeq") ``` ### Installing Locally Or you can source the functions individually: ```{r, eval=FALSE} functions = list.files("./R", full.names = TRUE) invisible(lapply(functions, source)) ``` But you will need to load in the additional libraries: ```{r, eval=FALSE} # Install CRAN packages invisible(lapply(c("MASS", "car", "ggplot2", "ggpubr", "lme4","lmerTest", "methods", "parallel", "plotly", "pbapply", "pbmcapply"), function(p){ if(! p %in% rownames(installed.packages())) { install.packages(p) } library(p, character.only=TRUE) })) # Install BioConductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") invisible(lapply(c("qvalue"), function(p){ if(! p %in% rownames(installed.packages())) BiocManager::install(p) library(p, character.only=TRUE) })) ``` # Overview To get started, first we load in the package: ```{r, message=FALSE, warning=FALSE} library(glmmSeq) set.seed(1234) ``` This vignette will demonstrate the power of this package using a minimal example from the [PEAC data set](https://peac.hpc.qmul.ac.uk/). Here we focus on the synovial biopsy RNA-Seq data from this cohort of patients with early rheumatoid arthritis. ```{r} data(PEAC_minimal_load) ``` This data contains: - metadata: which describes each sample. Including patient ID, sample time-point, and six-month EULAR response. Where [EULAR](https://www.das-score.nl/en/das-and-das28/das28-why/eular-response-criteria) is a rheumatoid arthritis response metric based on composite [DAS28 scores](https://www.das-score.nl/en/). - tpm: the transcript per million RNA-seq count data These are outlined in the following subsections. ## Metadata ```{r} metadata$EULAR_binary = NA metadata$EULAR_binary[metadata$EULAR_6m %in% c("Good", "Moderate" )] = "responder" metadata$EULAR_binary[metadata$EULAR_6m %in% c("Non-response")] = "non_responder" kable(head(metadata), row.names = F) %>% kable_styling() ``` ## Count data ```{r} kable(head(tpm)) %>% kable_styling() %>% scroll_box(width = "100%") ``` ## Dispersion Using negative binomial models requires gene dispersion estimates to be made. This can be achieved in a number of ways. A common way to calculate this for gene _i_ is to use the equation:

Dispersion_i = (variance_i - mean_i)/mean_i²

This can be calculated using: ```{r} disp <- apply(tpm, 1, function(x){ (var(x, na.rm=TRUE)-mean(x, na.rm=TRUE))/(mean(x, na.rm=TRUE)**2) }) head(disp) ``` Alternatively, we recommend using _edgeR_ to estimate of the common, trended and tagwise dispersions across all tags: ```{r, eval=FALSE} disp <- setNames(edgeR::estimateDisp(tpm)$tagwise.dispersion, rownames(tpm)) head(disp) ``` or with _DESeq2_ using the raw counts: ```{r, eval=FALSE} dds <- DESeqDataSetFromTximport(txi = txi, colData = metadata, design = ~ 1) dds <- DESeq(dds) dispersions <- setNames(dispersions(dds), rownames(txi$counts)) ``` ## Size Factors There is also an option to include size factors for each gene. Again this can be estimated using: ```{r} sizeFactors <- colSums(tpm) sizeFactors <- sizeFactors / mean(sizeFactors) # normalise to mean = 1 head(sizeFactors) ``` Or using edgeR these can be calculated from the raw read counts: ```{r, eval=FALSE} sizeFactors <- calcNormFactors(counts, method="TMM") ``` Similarly, with DESeq2: ```{r, eval=FALSE} sizeFactors <- estimateSizeFactorsForMatrix(counts) ``` Note the `sizeFactors` vector needs to be centred around 1, since it used directly as an offset of form `log(sizeFactors)` in the GLMM model. # Fitting Models To fit a model for one gene over time we use a formula such as: > gene expression ~ fixed effects + random effects In R the formula is defined by both the fixed-effects and random-effects part of the model, with the response on the left of a ~ operator and the terms, separated by + operators, on the right. Random-effects terms are distinguished by vertical bars ("|") separating expressions for design matrices from grouping factors. For more information see the `?lme4::glmer`. In this case study we want to use time and response as fixed effects and the patients as random effects:

gene expression ~ time + response + (1 | patient)

To fit this model for all genes we can use the `glmmSeq` function. Note that this analysis can take some time, with 4 cores: - 1000 genes takes about 10 seconds - 20000 genes takes about 4 mins ```{r, warning=FALSE} results <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp, progress = TRUE) ``` or alternatively using two-factor classification with EULAR_binary: ```{r, warning=FALSE} results2 <- glmmSeq(~ Timepoint * EULAR_binary + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp) ``` ## Outputs This creates a GlmmSeq object which contains the following slots: ```{r} names(attributes(results)) ``` The variables used by the model are in the `@modeldata`: ```{r} kable(results@modelData) %>% kable_styling() ``` The model fit statistics can be viewed in the `@stats` slot which is a list of items including fitted model coefficients, their standard errors and the results of statistical tests on terms within the model using Wald type 2 Chi square. To see the most significant interactions we can order `pvals` by `Timepoint:EULAR_6m`: ```{r} stats <- summary(results) kable(stats[order(stats[, 'P_Timepoint:EULAR_6m']), ]) %>% kable_styling() %>% scroll_box(width = "100%", height = "400px") ``` Summary statistics on a single gene can be shown specifying the argument `gene`. ```{r} summary(results, gene = "MS4A1") ``` Estimated means based on each gene's fitted model to show fixed effects and their 95% confidence intervals can be seen in the `@predict` slot: ```{r} predict = data.frame(results@predict) kable(predict) %>% kable_styling() %>% scroll_box(width = "100%", height = "400px") ``` ## Q-values The q-values from each of the p-value columns can be calculated using `glmmQvals`. This will print a significance table based on the cut-off (default p=0.05) and add a matrix named `qvals` to the `@stats` slot: ```{r} results <- glmmQvals(results) ``` # glmmTMB models The `glmmTMB` package is an alternative to `lme4` for fitting negative binomial GLMM models. Note, that this package optimises the dispersion parameter for each GLMM model. This has the advantage that a predefined list of dispersions for each gene is not required. But it also means that model fitting is significantly slower than a negative binomial GLMM fit by `lme4::glmer` with family function `MASS::negative.binomial` for which the dispersion parameter must be pre-specified. Each `glmmTMB` model takes about 0.8 seconds, so that even with 8 cores, a typical transcriptome of 20,000 genes would take about 30 minutes, although this will vary depending on the complexity of the design formula. In comparison, the standard `glmer` pipeline with known dispersions supplied takes about 2 minutes on 8 cores. Although `glmmTMB` is slower than `lme::glmer` with known dispersion, `glmmTMB` is still faster than the equivalent `lme4` function `glmer.nb` when the dispersion is unknown. Setting `method = "glmmTMB"` when calling `glmmSeq()` switches from using `lme4::glmer` (the default) to the `glmmTMB` package. The `dispersion` argument is then no longer required. The `family` argument can be used to specify different family functions and `glmmTMB` provides `nbinom1` and `nbinom2` (`glmmSeq` uses the latter by default). Other family functions e.g. `poisson` can be used. # Gaussian mixed effects models An alternative to fitting negative binomial GLMM is to use transformed data and fit gaussian LMM. The `lmmSeq` function will fit many LMM across genes (or other data) where random effects/mixed effects models are useful. It is almost 3x faster than `glmmSeq`. For example, gaussian normalised DNA methylation data can be analysed using this method. In the example below the RNA-Seq gene expression data is log transformed so that it is approximately Gaussian. Alternatively variance stabilising transformation (VST) can be applied to count data through DESeq2 or the voom transformation in limma voom. ```{r, warning=FALSE} logtpm <- log2(tpm + 1) lmmres <- lmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), maindata = logtpm, metadata = metadata, progress = TRUE) summary(lmmres, "MS4A1") ``` # Hypothesis testing By default, p-values for each model term are computed using Wald type 2 Chi-squared test as per `car::Anova()`. The underlying code for this has been optimised for speed. For LMM via `lmmSeq()`, an alternative to the Wald type 2 Chi-squared test is an F-test with Satterthwaite denominator degrees of freedom, which has been implemented using the `lmerTest` package and is enabled using `test.stat = "F"`. This is not available for GLMM. However, if a reduced model formula is specified by setting `reduced`, then a likelihood ratio test is performed instead using `anova`. This will double computation time since two (G)LMM have to be fitted for each gene. For further information on inference testing on GLMM, we recommend [Ben Bolker's GLMM FAQ page](https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html). ```{r, warning=FALSE} glmmLRT <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), reduced = ~ Timepoint + EULAR_6m + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp, verbose = FALSE) summary(glmmLRT, "MS4A1") ``` # Extracting individual Genes Similarly you can run the script for an individual gene (make sure you use `drop = FALSE` to maintain `countdata` as a matrix). ```{r, warning=FALSE} MS4A1glmm <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm["MS4A1", , drop = FALSE], metadata = metadata, dispersion = disp, verbose = FALSE) ``` The (g)lmer or glmmTMB fitted object for a single gene can be obtained using the function `glmmRefit()`: ```{r, warning=FALSE} fit <- glmmRefit(results, gene = "MS4A1") fit ``` These (g)lmer objects can be passed to other packages notably `emmeans` for visualising estimated marginal means. Note these are based on the model, not directly on the data. ```{r, warning=FALSE} library(emmeans) emmeans(fit, ~ Timepoint | EULAR_6m) emmip(fit, ~ Timepoint | EULAR_6m) ``` ## Refitting a different model `glmmRefit()` allows a different model to be fitted using the original data. The example below shows how to refit the model without the interaction term and then perform a likelihood ratio test using `anova`. Note for `glmmSeq()` objects the LHS of the reduced formula must be `"count"`, while for `lmmSeq()` objects the LHS must be `"gene"`. For glmmTMB analyses, the GLM family can also be changed. ```{r, warning=FALSE} fit2 <- glmmRefit(results, gene = "MS4A1", formula = count ~ Timepoint + EULAR_6m + (1 | PATID)) anova(fit, fit2) ``` # Model Plots For variables such as time, which are matched according to an ID (the random effect), we can examine the fitted model using plots which show estimated means and confidence intervals based on coefficients for the fitted regression model, overlaid upon the underlying data. In this case the samples are matched longitudinally over time. Plots can be viewed using either ggplot or base graphics. We can start looking at the gene with the most significant interaction _IGHV3-23_: ```{r, fig.height=6, warning=FALSE} plotColours <- c("skyblue", "goldenrod1", "mediumseagreen") modColours <- c("dodgerblue3", "goldenrod3", "seagreen4") shapes <- c(17, 19, 18) ggmodelPlot(results, geneName = "IGHV3-23", x1var = "Timepoint", x2var="EULAR_6m", xlab="Time", colours = plotColours, shapes = shapes, lineColours = plotColours, modelColours = modColours, modelSize = 10) ``` Alternatively plots can be generated using base graphics, here with or without the model fit overlaid. By default p-value labels are taken from the column names of the `pvals` object in the `stats` slot of the S4 result object. These can be relabelled using the `plab` argument. ```{r, fig.height=6, warning=FALSE} oldpar <- par(mfrow=c(1, 2)) modelPlot(results2, geneName = "FGF14", x1var = "Timepoint", x2var="EULAR_binary", fontSize=0.65, colours=c("coral", "mediumseagreen"), modelColours = c("coral", "mediumseagreen"), modelLineColours = "black", modelSize = 2) modelPlot(results, geneName = "EMILIN3", x1var = "Timepoint", x2var = "EULAR_6m", colours = plotColours, plab = c("time", "response", "time:response"), addModel = FALSE) par(oldpar) ``` To plot the model fits alone set `addPoints = FALSE`. ```{r, message=FALSE} library(ggpubr) p1 <- ggmodelPlot(results, "ADAM12", x1var="Timepoint", x2var="EULAR_6m", xlab="Time", addPoints = FALSE, colours = plotColours) p2 <- ggmodelPlot(results, "EMILIN3", x1var="Timepoint", x2var="EULAR_6m", xlab="Time", fontSize=8, x2Offset=1, addPoints = FALSE, colours = plotColours) ggarrange(p1, p2, ncol=2, common.legend = T, legend="bottom") ``` ## Fold change plots The comparative fold change (for `x1var` variables) between conditions (`x2var` and `x2Values` variables) can be plotted using an `fcPlot` for all genes to highlight significance. This type of plot is most suited to look for interaction between time (`x1var`) and a two-level factor (`x2var`), looking at change between two timepoints. In the example below from the R4RA study [1], gene expression pre- and post-drug treatment is compared between two drugs (rituximab & tocilizumab), using the design formula `gene_counts ~ time * drug`. By setting `graphics = "plotly"` this can be viewed interactively. ```{r, eval = FALSE} r4ra_glmm <- glmmSeq(~ time * drug + (1 | Patient_ID), countdata = tpmdata, metadata, dispersion = dispersions, cores = 8, removeSingles = T) r4ra_glmm <- glmmQvals(r4ra_glmm) labels = c(..) # Genes to label fcPlot(r4ra_glmm, x1var = "time", x2var = "drug", graphics = "plotly", pCutoff = 0.05, useAdjusted = TRUE, labels = labels, colours = c('grey', 'green3', 'gold3', 'blue')) ``` ```{r fcplot, echo = FALSE, message=FALSE, fig.align='center', out.width='80%', out.extra='style="border: 0;"'} knitr::include_graphics("r4ra_glmm_fcplot.png") ``` Log2 fold change between the two time points for individuals treated with rituximab on the x axis and individuals treated with tocilizumab on the y axis with each point representing a gene. Genes showing an interaction between time and drug are coloured blue or gold depending on whether their fold change is greater post-rituximab (blue) or post-tocilizumab (gold). Genes without interaction, but changing significantly over time are coloured green and tend to lie along the line of identity. See the Longitudinal tab in for an interactive version of the above plot. ## MA plots An MA plot is an application of a Bland–Altman plot. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. ```{r, fig.height=8} labels = c('MS4A1', 'FGF14', 'IL2RG', 'IGHV3-23', 'ADAM12', 'IL36G', 'BLK', 'SAA1', 'CILP', 'EMILIN3', 'EMILIN2', 'IGHJ6', 'CXCL9', 'CXCL13') maPlots <- maPlot(results, x1var="Timepoint", x2var="EULAR_6m", x2Values=c("Good", "Non-response"), colours=c('grey', 'midnightblue', 'mediumseagreen', 'goldenrod'), labels=labels, graphics="ggplot") maPlots$combined ``` # Troubleshooting Mixed effects models are tricky to fit and `lme4::glmer` sometimes returns errors. In fact, a significant amount of code in `glmmSeq` is devoted to catching and handling errors to allow parallelisation to continue. Errors in genes are stored in the `errors` slot. ```{r} results@errors[1] # first gene error ``` Sometimes `glmmSeq` returns errors in all genes. This usually means a problem with not enough samples in each timepoint or a mistake in the formula. Since version 0.5.5 `glmmSeq` now returns a vector of the error messages for all genes, which can be useful for debugging. In the example below, only timepoint 0 is specified. ```{r, error=TRUE} results3 <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm[, metadata$Timepoint == 0], metadata = metadata[metadata$Timepoint == 0, ], dispersion = disp) ``` If there are errors which are not caught by the error checking core mechanism, this can lead to problems with grouping results after the core models have been fit. Setting `returnList=TRUE` when calling `glmmSeq` returns the list output direct from `mclapply` (or `parLapply` on windows). This can be helpful for debugging unforeseeen problems in the core loop. # Citing glmmSeq glmmSeq was developed by the bioinformatics team at the [Experimental Medicine & Rheumatology department](https://www.qmul.ac.uk/whri/emr/) and [Centre for Translational Bioinformatics](https://www.qmul.ac.uk/c4tb/) at Queen Mary University London. If you use this package please cite as: ```{r, warning=FALSE} citation("glmmSeq") ``` # References 1. Felice Rivellese, Anna Surace, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk, Giovanni Giorli, ... Michael Barnes, Myles J. Lewis, Costantino Pitzalis, R4RA collaborative group. Rituximab versus tocilizumab in rheumatoid arthritis: synovial biopsy-based biomarker analysis of the phase 4 R4RA randomized trial. Nature medicine 2022; 28(6): 1256-68. [doi:10.1038/s41591-022-01789-0](https://doi.org/10.1038/s41591-022-01789-0) Statistical software used in this package: 2. [lme4](https://dx.doi.org/10.18637/jss.v067.i01): Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi: 10.18637/jss.v067.i01. 3. [car](https://socialsciences.mcmaster.ca/jfox/Books/Companion/): John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/ 4. [MASS](https://www.stats.ox.ac.uk/pub/MASS4/VR4stat.pdf): Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 5. [glmmTMB](https://CRAN.R-project.org/package=glmmTMB): Mollie Brooks, Ben Bolker, Kasper Kristensen, Martin Maechler, Arni Magnusson (2022). glmmTMB: Generalized Linear Mixed Models using Template Model Builder 6. [qvalue](https://github.com/StoreyLab/qvalue): John D. Storey, Andrew J. Bass, Alan Dabney and David Robinson (2020). qvalue: Q-value estimation for false discovery rate control. R package version 2.22.0. https://github.com/StoreyLab/qvalue 7. [lmerTest](https://CRAN.R-project.org/package=lmerTest): Alexandra Kuznetsova, Per Brockhoff, Rune Christensen, Sofie Jensen. lmerTest: Tests in Linear Mixed Effects Models 8. [emmeans](https://CRAN.R-project.org/package=emmeans): Russell V. Lenth, Paul Buerkner, Maxime Herve, Jonathon Love, Fernando Miguez, Hannes Riebl, Henrik Singmann. emmeans: Estimated Marginal Means, aka Least-Squares Means