DESeq Workshop 1 - Standard Analyses for RNA-seq Data (unsupervised)

Abstract

This workshop summarizes the main approaches to analyzing sequencing data to obtain per-gene count data, but does not go through how to do it explicitly. It demonstrates approaches to summarizing count data using various Bioconductor annotation resources, followed by unsupervised exploratory data analysis. It introduces the (Ranged)SummarizedExperiment data class, variance-stabilizing transformations, and the use of distances to visualize transcriptome shifts through methods such as Principal Components Analysis and Multidimensional Scaling.

This workshop uses quotes and materials from RNA-seq workflow: gene-level exploratory analysis and differential expression by Love, Anders, Kim, and Huber.

This notebook is forked from two notebooks originally created by Levi Waldron, with modifications to the text and code performed by Andrew Leith. The The remaining material will be adapted into a second workshop on differential expression analysis with DESeq2 at a later date.

Outline

  1. Introduction to RNA-seq Data
  2. Initial Data Exploration
  3. Model Syntax
    • Design Matrices
  4. Filtering
  5. Data Transformations
    • log2
    • rlog
    • vst
    • Transformation Visualization
      • Boxplots
      • Scatterplots
  6. Sample Heteogenity
    • Heatmaps
    • Component plots
      • PCA plots
      • MDS plots

Approaches to processing raw data: alignment vs. alignment-free methods

The computational analysis of an RNA-seq experiment begins from the FASTQ files that contain the nucleotide sequence of each read and a quality score at each position. These reads are either aligned to a reference genome or transcriptome, or the abundances and estimated counts per transcript can be estimated without alignment. For a paired-end experiment, the software will require both FASTQ files. The output of this alignment step is commonly stored in a file format called SAM/BAM.

RNA-seq aligners differ from DNA aligners in their ability to recognize intron-sized gaps that are spliced from transcripts, and their objective of counting transcripts. Baruzzo et al. 2017 provide a useful summary of software for aligning RNA-seq reads to a reference genome, and benchmark these using synthetic data for base-level and splice junction precision and recall, sensitivity to tuning parameters, and runtime performance. Some popular alignment-based tools include STAR, TopHat2, Subread, and Novoalign. de novo alignment does not use a reference genome: while it offers the possibility of discovering novel transcripts in rearranged genomes and genomes without good references, it requires more resources and manual effort and is less accurate for known transcripts, so for most people working with model organisms it is probably not a good option.

Alignment-free tools have gained popularity for their fast performance and low memory requirements while maintaining good accuracy. Instead of directly aligning RNA-seq reads to a reference genome, they perform "pseudo-alignment" by assigning reads to transcripts that contain compatible k-mers. The most popular of these tools are Salmon and Kallisto. These generate a table of read counts directly without the need for a SAM/BAM file as a starting or intermediate step.

You can easily perform RNA-seq alignment in R using Rsubread and GenomicAlignments, but the rest of this workshop assumes you are starting with the output of some sequence analysis pipeline (i.e. counts).

Annotating and importing expression counts

Measures of transcript abundance

The differential expression analyses introduced here assume you are analyzing a count table of Transcripts Per Million (TPM) or Counts Per Million (CPM) by samples. Note that while TPM and CPM would order the abundance of different transcripts very differently, for differential expression these differences are not as important. You should not use Fragments Per Kilobase per Million reads (FPKM) or otherwise normalize by library size / read depth. The negative binomial log-linear statistical model used by DESeq2 and edgeR are intended for use on count data, and achieve better performance than can be achieved with read depth-normalized data. Read depth-normalized data will introduce bias.

The SummarizedExperiment data class for representing RNA-seq data

The component parts of a SummarizedExperiment object.

  • assay(se) or assays(se)$counts contains the matrix of counts
  • colData(se) may contain data about the columns, e.g. patients or biological units
  • rowData(se) may contain data about the rows, e.g. genes or transcript
  • rowRanges(se) may contain genomic ranges for the genes/transcripts
  • metadata(se) may contain information about the experiment

summarized experiment figure

Note: The rowRanges could be a GRanges providing ranges of genes, or a GRangesList providing ranges of exons grouped by gene. Both are accessible from TxDb databases.

RangedSummarizedExperiment is a subclass of SummarizedExperiment, where the rowRanges is not empty. You can convert a SummarizedExperiment to a RangedSummarizedExperiment just by setting its rowRanges.

Some exploratory analysis of a demo RangedSummarizedExperiment:

In [1]:
suppressPackageStartupMessages({
    library(airway)
    library(SummarizedExperiment)
})

For the purposes of this tutorial, we will be using a default dataset that is part of R's airway package. The following command loads the airway dataset into our R environment:

In [2]:
data(airway)
In [3]:
airway
class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

Characterizing the Dataset

Now, we will call some basic functions on this dataset to get a better sense of its fundamental characteristics, including dimensions, the number of counts present for each sample, and its structure:

In [4]:
dim(airway)
  1. 64102
  2. 8
In [5]:
assayNames(airway)
'counts'
In [6]:
head(assay(airway), 3)
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516SRR1039517SRR1039520SRR1039521
ENSG00000000003679 448 873 408 11381047770 572
ENSG00000000005 0 0 0 0 0 0 0 0
ENSG00000000419467 515 621 365 587 799417 508
In [7]:
colSums(assay(airway))
SRR1039508
20637971
SRR1039509
18809481
SRR1039512
25348649
SRR1039513
15163415
SRR1039516
24448408
SRR1039517
30818215
SRR1039520
19126151
SRR1039521
21164133

The rowRanges here is a GRangesList. The first gene has 17 exons:

In [8]:
rowRanges(airway)
GRangesList object of length 64102:
$ENSG00000000003 
GRanges object with 17 ranges and 2 metadata columns:
       seqnames            ranges strand |   exon_id       exon_name
          <Rle>         <IRanges>  <Rle> | <integer>     <character>
   [1]        X 99883667-99884983      - |    667145 ENSE00001459322
   [2]        X 99885756-99885863      - |    667146 ENSE00000868868
   [3]        X 99887482-99887565      - |    667147 ENSE00000401072
   [4]        X 99887538-99887565      - |    667148 ENSE00001849132
   [5]        X 99888402-99888536      - |    667149 ENSE00003554016
   ...      ...               ...    ... .       ...             ...
  [13]        X 99890555-99890743      - |    667156 ENSE00003512331
  [14]        X 99891188-99891686      - |    667158 ENSE00001886883
  [15]        X 99891605-99891803      - |    667159 ENSE00001855382
  [16]        X 99891790-99892101      - |    667160 ENSE00001863395
  [17]        X 99894942-99894988      - |    667161 ENSE00001828996

...
<64101 more elements>
-------
seqinfo: 722 sequences (1 circular) from an unspecified genome

The rowRanges object also contains information about the gene model in the metadata slot. Here we use a helpful R function, str, to display the metadata compactly:

In [9]:
str(metadata(rowRanges(airway)))
List of 1
 $ genomeInfo:List of 20
  ..$ Db type                                 : chr "TranscriptDb"
  ..$ Supporting package                      : chr "GenomicFeatures"
  ..$ Data source                             : chr "BioMart"
  ..$ Organism                                : chr "Homo sapiens"
  ..$ Resource URL                            : chr "www.biomart.org:80"
  ..$ BioMart database                        : chr "ensembl"
  ..$ BioMart database version                : chr "ENSEMBL GENES 75 (SANGER UK)"
  ..$ BioMart dataset                         : chr "hsapiens_gene_ensembl"
  ..$ BioMart dataset description             : chr "Homo sapiens genes (GRCh37.p13)"
  ..$ BioMart dataset version                 : chr "GRCh37.p13"
  ..$ Full dataset                            : chr "yes"
  ..$ miRBase build ID                        : chr NA
  ..$ transcript_nrow                         : chr "215647"
  ..$ exon_nrow                               : chr "745593"
  ..$ cds_nrow                                : chr "537555"
  ..$ Db created by                           : chr "GenomicFeatures package from Bioconductor"
  ..$ Creation time                           : chr "2014-07-10 14:55:55 -0400 (Thu, 10 Jul 2014)"
  ..$ GenomicFeatures version at creation time: chr "1.17.9"
  ..$ RSQLite version at creation time        : chr "0.11.4"
  ..$ DBSCHEMAVERSION                         : chr "1.0"

The colData:

In [10]:
colData(airway)
DataFrame with 8 rows and 9 columns
           SampleName     cell      dex    albut        Run avgLength
             <factor> <factor> <factor> <factor>   <factor> <integer>
SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
           Experiment    Sample    BioSample
             <factor>  <factor>     <factor>
SRR1039508  SRX384345 SRS508568 SAMN02422669
SRR1039509  SRX384346 SRS508567 SAMN02422675
SRR1039512  SRX384349 SRS508571 SAMN02422678
SRR1039513  SRX384350 SRS508572 SAMN02422670
SRR1039516  SRX384353 SRS508575 SAMN02422682
SRR1039517  SRX384354 SRS508576 SAMN02422673
SRR1039520  SRX384357 SRS508579 SAMN02422683
SRR1039521  SRX384358 SRS508580 SAMN02422677

Exploratory data analysis

Standardization / transformation of count data for visual exploration

Features (e.g. genes) are typically "standardized", because:

  1. the differences in overall levels between features are often not due to biological effects but technical ones, e.g. GC bias, PCR amplification efficiency, ...
  2. otherwise a few very high-variance genes can dominate the distance, and make the distances less biologically relevant

Standardization traditionally meant converting to z-score:

$$x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}$$

But DESeq2 and edgeR provide their own RNA-seq specific standardization methods.

We will use standardization for the purposes of visualization below, but for statistical testing in the next period, we will go back to the original count data as required by the statistical procedures used.

Exploring the airway dataset

Once we have our fully annotated SummarizedExperiment object, we can construct a DESeqDataSet object from it that will then form the starting point of the analysis. We add an appropriate design for the analysis:

In [11]:
suppressPackageStartupMessages(library(DESeq2))
In [12]:
dds <- DESeqDataSet(airway, design = ~ cell + dex)

Of particular note is the model formula, which is passed as the design argument. Here, ~ cell + dex means that we are concerned with two covariates, cell and dex. ~ is how R denotes a model.

Model specification

Model formulae tutorial

  • Many regression functions in R use a standard "model formula" syntax
  • DEseq is a special case of regression that utilizes a negative binomial distribution
  • The formula determines the model that will be built (and tested) by the R procedure. The basic format is:

> outcome (or response) variable ~ explanatory variables (or predictors)

  • The tilde means "is modeled by" or "is modeled as a function of" - here, the left side of the model is implicit because DEseq is acting on a matrix of counts; these counts constitute the outcome/response, so no explicit specification of the response is required

Essential model formula syntax

symbol example meaning
+ + x include this variable
- - x delete this variable
: x : z include the interaction
* x * z include these variables and their interactions
^ (u + v + w)^3 include these variables and all interactions up to three way*
1 -1 intercept: delete the intercept

* i.e. u + v + w + uv + uw + vw + uvw

Note: order generally doesn't matter (u+v OR v+u evaluate to the same mathematical model)

Examples: ~treatment, ~treatment + time, ~treatment * time

The Design Matrix

The Design Matrix

Use an example of a multiple linear regression model:

$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i$

  • $x_{ji}$ is the value of predictor $x_j$ for observation $i$

The Design Matrix

Matrix notation for the multiple linear regression model:

$$ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} $$

or simply:

$$ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} $$

  • The design matrix is $\mathbf{X}$
    • which the computer will take as a given when solving for $\boldsymbol{\beta}$ by minimizing the sum of squares of residuals $\boldsymbol{\varepsilon}$.

There are multiple possible and reasonable design matrices for a given study design

  • the model formula encodes a default model matrix, e.g.:
In [13]:
group <- factor( c(1, 1, 2, 2) )
model.matrix(~ group)
(Intercept)group2
110
210
311
411

What if we forgot to code group as a factor, and instead left it as numeric?

In [14]:
group <- c(1, 1, 2, 2)
model.matrix(~ group)
(Intercept)group
111
211
312
412

The model matrix can represent multiple predictors, ie multiple (linear/log-linear/logistic) regression:

In [15]:
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
(Intercept)diet2sexm
1100
2100
3101
4101
5110
6110
7111
8111

Why is f used as the default reference group instead of m? Because ASCII ordering (alphanumeric) is default.

More information on this topic can be found in Appendix A, which will be covered if time permits.

Filtering genes not satisfying a minimum expression threshold

Our count matrix contains many rows with only zeros, and additionally many rows with only a few fragments total. In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression (rows very close to zero but not exactly zero are likely artefacts / noise not indicative of actual gene expression). Here we apply the most minimal and conservative filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples. Additional weighting/filtering to improve power is applied at a later step in the workflow.

In [16]:
nrow(dds)
64102
In [17]:
dds.original <- dds
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
29391

NB: only 46% of the genes remain after performing this filtering step, demonstrating the degree to which our performance will be improved by discarding the noninformative entries. A more stringent method of filtering, such as removing all genes that have any 0 count entries at all, would yield an even smaller percentage of genes.

In [18]:
head(as.data.frame(counts(dds)))
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516SRR1039517SRR1039520SRR1039521
ENSG00000000003 679 448 873 408 1138 1047 770 572
ENSG00000000419 467 515 621 365 587 799 417 508
ENSG00000000457 260 211 263 164 245 331 233 229
ENSG00000000460 60 55 40 35 78 63 76 60
ENSG00000000938 0 0 2 0 1 0 0 0
ENSG000000009713251 3679 6177 4252 6721 110275176 7995
In [19]:
options(repr.plot.height=4, repr.plot.width=7)
par(mfrow=c(1,2))

plot(density(as.data.frame(counts(dds.original))$SRR1039508), xlim = c(0, 10000), ylim = c(0, .2), main = "Raw Densities", col = "Blue")
lines(density(as.data.frame(counts(dds.original))$SRR1039509), col = "Red")
lines(density(as.data.frame(counts(dds.original))$SRR1039512), col = "Green")
lines(density(as.data.frame(counts(dds.original))$SRR1039513), col = "Orange")
lines(density(as.data.frame(counts(dds.original))$SRR1039516), col = "Purple")
lines(density(as.data.frame(counts(dds.original))$SRR1039517), col = "Black")
lines(density(as.data.frame(counts(dds.original))$SRR1039520), col = "Yellow")
lines(density(as.data.frame(counts(dds.original))$SRR1039521), col = "Gray")

plot(density(as.data.frame(counts(dds))$SRR1039508), xlim = c(0, 2500), ylim = c(0, .01), main = "Filtered Densities", col = "Blue")
lines(density(as.data.frame(counts(dds))$SRR1039509), col = "Red")
lines(density(as.data.frame(counts(dds))$SRR1039512), col = "Green")
lines(density(as.data.frame(counts(dds))$SRR1039513), col = "Orange")
lines(density(as.data.frame(counts(dds))$SRR1039516), col = "Purple")
lines(density(as.data.frame(counts(dds))$SRR1039517), col = "Black")
lines(density(as.data.frame(counts(dds))$SRR1039520), col = "Yellow")
lines(density(as.data.frame(counts(dds))$SRR1039521), col = "Gray")

As we can see, the count densities before and after filtration are radically different, even with the raw densities portrayed over radically larger axis ranges. The reason for this divergence is that over 30,000 zero-count genes have shifted the density so that almost all of the area under the curve is near 0.

These 0-genes do not contribute meaningful information to our analysis (and can even bias some of our analytical techniques) so we remove them.

Note: For differential expression analysis later, filtering is allowable but not necessary if using Independent Hypothesis Weighting, as is the default behavior of DESeq2. Independent hypothesis weighting (IHW) is a multiple testing procedure that increases power compared to the method of Benjamini and Hochberg by assigning data-driven weights to each hypothesis.

Why is it Important to Transform Data Before Analysis?

The log, rlog, and variance stabilizing transformations

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq counts, however, the expected variance grows with the mean. For example, if one performs PCA directly on a matrix of counts or normalized counts (e.g. correcting for differences in sequencing depth), the resulting plot typically depends mostly on the genes with highest counts because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a pseudocount of 1 (to account for 0-count observations, as the log of 0 is undefined and the log of 1 is always 0 irrespective of base); however, depending on the choice of pseudocount, the genes with the very lowest counts will contribute a great deal of noise to the resulting plot, because taking the logarithm of small counts actually inflates their variance. We can quickly show this property of counts with simulated data (here, Poisson-distributed counts with a range of lambda from 0.1 to 100). We plot the standard deviation of each row (genes) against the mean:

In [20]:
options(repr.plot.height=3, repr.plot.width=5)
lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library(vsn)
meanSdPlot(cts, ranks = FALSE)

And for logarithm-transformed counts:

In [21]:
options(repr.plot.height=3, repr.plot.width=5)
log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)

Note the markedly lower scale of the axes in the log-transformed data!

The logarithm with a small pseudocount (i.e. the standard log2(x+1) transformation) amplifies differences when the values are close to 0. The low count genes with low signal-to-noise ratio will then overly contribute to sample-sample distances and PCA plots.

As a solution, DESeq2 offers two transformations for count data that stabilize the variance across the mean: the regularized-logarithm transformation or rlog (Love, Huber, Anders, Genome Biology 2014), and the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber, Genome Biology 2010), implemented in the vst function.

For genes with high counts, the rlog and VST will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes' averages across all samples. The rlog-transformed or VST data then becomes approximately homoskedastic, and can be used directly for computing distances between samples, making PCA plots, or as input to downstream methods which perform best with homoskedastic data.

Which transformation to choose? The rlog tends to work well on small datasets (n < 30), sometimes outperforming the VST when there is a large range of sequencing depth across samples (an order of magnitude difference). The VST is much faster to compute and is less sensitive to high count outliers than the rlog. We therefore recommend the VST for large datasets (hundreds of samples). You can perform both transformations and compare the meanSdPlot or PCA plots generated, as described below.

Note that the two transformations offered by DESeq2 are provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described in the next workshop on differential expression analysis; this approach also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.

The function rlog returns an object based on the SummarizedExperiment class that contains the rlog-transformed values in its assay slot. The assay slot is merely the piece of the dds object that contains the actual counts (as opposed to other information like metadata). R is programmed to know that when you call rlog on a dds object, and should transform the contents of the assay slot and not the entire object.

First, we will depict the untransformed data, against which we can visually compare the results of the two aforementioned transformations. Note the extreme outliers and leverage points in the upper right of the plot, each colored dark blue to show that only one gene falls within each point - these observations would cause a great deal of bias in the result if uncorrected:

In [22]:
suppressPackageStartupMessages(library(vsn))
options(repr.plot.height=3, repr.plot.width=5)
head(assay(dds), 3)
par(mfrow=c(1,2))
meanSdPlot(assay(dds), ranks=FALSE)
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516SRR1039517SRR1039520SRR1039521
ENSG00000000003679 448 873 408 11381047770 572
ENSG00000000419467 515 621 365 587 799417 508
ENSG00000000457260 211 263 164 245 331233 229

Now we will implement the rlog function:

In [23]:
options(repr.plot.height=3, repr.plot.width=5)
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
meanSdPlot(assay(rld), ranks=FALSE)
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516SRR1039517SRR1039520SRR1039521
ENSG000000000039.3856819.0525999.5168779.2853359.8390939.5303139.6632609.277695
ENSG000000004198.8696119.1382749.0361179.0752968.9721259.1318288.8615299.060906
ENSG000000004577.9613737.8813857.8240757.9214937.7516907.8864417.9571267.912125

The function vst returns a similar object. Please note that the VST transformation is meant for exploratory data analysis and visualization only and is not to be used in actual differential expression analysis.

In [24]:
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
meanSdPlot(assay(vsd), ranks=FALSE)
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516SRR1039517SRR1039520SRR1039521
ENSG000000000039.742074 9.430420 9.867626 9.645844 10.1831439.880416 10.0103669.639782
ENSG000000004199.333668 9.581706 9.486145 9.523093 9.4272839.574860 9.3259989.509246
ENSG000000004578.765274 8.698448 8.651473 8.732426 8.5927868.702673 8.7619458.724101

In the above function calls, we specified blind = FALSE, which means that differences between cell lines and treatment (the variables in the design) will not contribute to the expected variance-mean trend of the experiment. The experimental design is not used directly in the transformation, only in estimating the global amount of variability in the counts. For a fully unsupervised transformation, one can set blind = TRUE (which is the default).

To show the effect of the transformation, in the figure below we plot the first sample against the second, first simply using the log2 function (after adding 1, to avoid taking the log of zero, which as previously mentioned is undefined and would cause an error), and then using the rlog- and VST-transformed values. For the log2 approach, we need to first estimate size factors to account for sequencing depth, and then specify normalized=TRUE. Sequencing depth correction is done automatically for the rlog and the vst.

In [25]:
dds <- estimateSizeFactors(dds)
In [26]:
ddsESF <- estimateSizeFactors(dds)
df1 <- data.frame(log2(counts(ddsESF, normalized=TRUE)[, 1:2] + 1))
df1$transformation <- "log2(x + 1)"
df2 <- data.frame(assay(rld)[, 1:2])
df2$transformation <- "rld"
df3 <- data.frame(assay(vsd)[, 1:2])
df3$transformation <- "vsd"
df <- rbind(df1, df2, df3)
colnames(df)[1:2] <- c("x", "y")
head(df)
table(df$transformation)
xytransformation
ENSG00000000003 9.375722 8.968399 log2(x + 1)
ENSG00000000419 8.836718 9.169098 log2(x + 1)
ENSG00000000457 7.994317 7.885375 log2(x + 1)
ENSG00000000460 5.897577 5.962838 log2(x + 1)
ENSG00000000938 0.000000 0.000000 log2(x + 1)
ENSG0000000097111.633403 12.003610 log2(x + 1)
log2(x + 1)         rld         vsd 
      29391       29391       29391 
In [27]:
options(repr.plot.width=6, repr.plot.height=2.5)
library(ggplot2)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)

Scatterplot of transformed counts from two samples. Shown are scatterplots using the log2 transform of normalized counts (left), using the rlog (middle), and using the VST (right). While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y=x in these scatterplots) which will contribute to the distance calculations and the PCA plot.

We can see how genes with low counts (bottom left-hand corner) seem to be excessively variable on the ordinary logarithmic scale, while the rlog transform and VST compress differences for the low count genes for which the data provide little information about differential expression. As any differences in these low-count values are likely artefacts introduced by error or random chance, their exclusion helps to prevent false positives.

Boxplots of transformed distributions

Boxplots of the count distributions in each sample are a good way to understand the effects these transformations have at the level of individual subjects.

In R's boxplots, points classified as outliers are shown on the plots as circles. Here, neither the log2 nor the rld transformations include any outliers, while the vsd transformations do have outliers at the upper end of the distributions. These points are classified as outliers because the vsd method compresses lower end points more substantially, as can be seen on the scatter plot above - the actual magnitude of these points does not significantly exceed the magnitude seen with the other transformations.

In [28]:
par(mfrow=c(1,3))
boxplot(log2(assay(ddsESF)+1), las=2, main="log2(x+1)")
boxplot(assay(rld), las=2, main="rld")
boxplot(assay(vsd), las=2, main="vsd")

Sample distances

The importance of distances

High-dimensional data are complex and impossible to visualize in raw form. They represent thousands of dimensions, but we can only visualize 2-3.

All clustering and classification of samples and/or genes involves combining or identifying objects that are close or similar. Distances or similarities are mathematical representations of what we mean by close or similar. The choice of distance is important and varies for different subject areas and types of data.

See: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

Distances for exploratory RNA-seq data analysis

A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design?

We use the R function dist to calculate the "Euclidean distance" between samples; the Euclidean distance is simply the higher-dimensional analogue of the human-measurable distance between points in 2 or 3 dimensions. To ensure we have a roughly equal contribution from all genes, we calculate it from the rlog-transformed data. We need to transpose the matrix of values using t, because the dist function expects the different samples to be rows of its argument, and different dimensions (here, genes) to be columns.

In [29]:
sampleDists <- dist(t(assay(rld)))
sampleDists
           SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
SRR1039509   45.70157                                                       
SRR1039512   39.25600   54.91229                                            
SRR1039513   62.63599   44.53085   48.72888                                 
SRR1039516   44.50962   59.06803   43.58242   63.74707                      
SRR1039517   64.49868   51.45303   59.23397   49.88394   47.48509           
SRR1039520   39.58027   57.46654   36.74746   58.49397   46.41179   63.60395
SRR1039521   63.36521   45.06068   57.87997   36.49799   65.55042   52.32104
           SRR1039520
SRR1039509           
SRR1039512           
SRR1039513           
SRR1039516           
SRR1039517           
SRR1039520           
SRR1039521   50.13745

We visualize the distances in a heatmap in a figure below, using the function pheatmap from the pheatmap package.

In [30]:
library(pheatmap)
library(RColorBrewer)

In order to plot the sample distance matrix with the rows/columns arranged by the distances in our distance matrix, we manually provide sampleDists to the clustering_distance argument of the pheatmap function. Otherwise the pheatmap function would assume that the matrix contains the data values themselves, and would calculate distances between the rows/columns of the distance matrix, which is not desired. We also manually specify a blue color palette using the colorRampPalette function from the RColorBrewer package (this coloring step is optional and is determined by user preference).

In [31]:
options(repr.plot.height=3, repr.plot.width=5)
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Heatmap of sample-to-sample distances using the rlog-transformed values.

Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap.

Another option for calculating sample distances is to use the Poisson Distance, implemented in the PoiClaClu CRAN package. This measure of dissimilarity between counts also takes the inherent variance structure of counts into consideration when calculating the distances between samples. The PoissonDistance function takes the original count matrix (not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.

In [32]:
library(PoiClaClu)
poisd <- PoissonDistance(t(counts(ddsESF)))

We plot the heatmap below.

In [33]:
options(repr.plot.height=3, repr.plot.width=5)
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste( rld$dex, rld$cell, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)

Heatmap of sample-to-sample distances using the Poisson Distance.

PCA plot

Another way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1 ("principal component 1"). The y-axis is a direction (it must be orthogonal - perpendicular - to the first direction) that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not add to 100%, because there are more dimensions - often many of them - that contain the remaining variance (although each of these remaining dimensions will by definition explain less than the two that we see).

In [34]:
colData(rld)
plotPCA(rld, intgroup = c("dex", "cell"))
DataFrame with 8 rows and 10 columns
           SampleName     cell      dex    albut        Run avgLength
             <factor> <factor> <factor> <factor>   <factor> <integer>
SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
           Experiment    Sample    BioSample        sizeFactor
             <factor>  <factor>     <factor>         <numeric>
SRR1039508  SRX384345 SRS508568 SAMN02422669  1.02364764926706
SRR1039509  SRX384346 SRS508567 SAMN02422675 0.896166721793923
SRR1039512  SRX384349 SRS508571 SAMN02422678  1.17948612081678
SRR1039513  SRX384350 SRS508572 SAMN02422670 0.670053829048202
SRR1039516  SRX384353 SRS508575 SAMN02422682  1.17767135405022
SRR1039517  SRX384354 SRS508576 SAMN02422673  1.39903646915342
SRR1039520  SRX384357 SRS508579 SAMN02422683 0.920778683328161
SRR1039521  SRX384358 SRS508580 SAMN02422677 0.944514089340919

PCA plot using the rlog-transformed values. Each unique combination of treatment and cell line is given its own color.

Here, we have used the function plotPCA() that comes with DESeq2. The two terms specified by intgroup are the interesting groups for labeling the samples; they tell the function to use them to choose colors. We can also build the PCA plot from scratch using the ggplot2 package. This is done by asking the plotPCA() function to return the data used for plotting rather than building the plot. See the ggplot2 documentation for more details on using ggplot().

In [35]:
pcaData <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData = TRUE)
pcaData
percentVar <- round(100 * attr(pcaData, "percentVar"))
PC1PC2groupdexcellname
SRR1039508-17.818478 -4.021260 untrt:N61311 untrt N61311 SRR1039508
SRR1039509 8.388115 -1.491329 trt:N61311 trt N61311 SRR1039509
SRR1039512-10.227671 -5.004222 untrt:N052611untrt N052611 SRR1039512
SRR1039513 17.533464 -3.910088 trt:N052611 trt N052611 SRR1039513
SRR1039516-14.672156 15.874223 untrt:N080611untrt N080611 SRR1039516
SRR1039517 10.988342 20.599713 trt:N080611 trt N080611 SRR1039517
SRR1039520-12.060845 -11.986221 untrt:N061011untrt N061011 SRR1039520
SRR1039521 17.869229 -10.060816 trt:N061011 trt N061011 SRR1039521

We can then use this structure to build up a second plot in a figure below, specifying that the color of the points should reflect dexamethasone treatment and the shape should reflect the cell line.

In [36]:
ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size =3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed()

PCA plot using the rlog-transformed values with custom ggplot2 code. Here we specify cell line (plotting symbol) and dexamethasone treatment (color).

From the PCA plot, we see that the differences between cells (the different plotting shapes) are considerable, though not stronger than the differences due to treatment with dexamethasone (red vs blue color). This shows why it will be important to account for this in differential testing by using a paired design ("paired", because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this design by assigning the formula ~ cell + dex earlier when specifying our model.

MDS plot

Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don't have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the rlog transformed counts and plot these in a figure below.

This method, unlike PCA, does not show the percent of variance explained by the individual components.

MDS plot using rlog-transformed values.

In [37]:
mds <- cbind(as.data.frame(colData(rld)), cmdscale(sampleDistMatrix))
ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()

In the figure below we show the same plot for the PoissonDistance. Note the difference in axis scale.

MDS plot using the Poisson Distance.

In [38]:
mdsPois <- cbind(as.data.frame(colData(ddsESF)),
   cmdscale(samplePoisDistMatrix))
ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()

PCA preserves the covariance of the data, while MDS preserves the distance between the data points. Depending on what type of distance you use for your MDS analysis, these two results can be the same (if Euclidean distance is used) or different. Here, we have used the Poisson distance (since it takes the structure of the variance into account), so the MDS and PCA results are not the same.

Appendices

Appendix A

Interaction terms

Here is an example again of a single predictor, but with more groups

In [39]:
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
(Intercept)group2group3
1100
2100
3110
4110
5101
6101

The baseline or reference group is the one that other groups are contrasted against. You can change the baseline group:

In [40]:
group <- factor(c(1,1,2,2,3,3))
group <- relevel(x=group, ref=3)
model.matrix(~ group)
(Intercept)group1group2
1110
2110
3101
4101
5100
6100

Here, we demonstrate manually adding interaction terms:

In [41]:
model.matrix(~ diet + sex + diet:sex)
(Intercept)diet2sexmdiet2:sexm
11000
21000
31010
41010
51100
61100
71111
81111

The result is the same as using the * operator:

In [42]:
model.matrix(~ diet * sex)
(Intercept)diet2sexmdiet2:sexm
11000
21000
31010
41010
51100
61100
71111
81111

Appendix B

Additional options for mapping transcripts to genes

Option 1. Bioconductor TxDb.* packages, which are available for a number of model species (search for "TxDb" at https://bioconductor.org/packages/3.6/data/annotation/). These are slightly less convenient for this purpose than the Ensembl packages. For example, there is a TxDb package Mus Musculus from UCSC build mm10 based on the knownGene Track:

In [43]:
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
keytypes(txdb)
k <- keys(txdb, "TXNAME")
columns(txdb)
tx4gene <- select(txdb, keys=k, keytype="TXNAME", column="GENEID")
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
  1. 'CDSID'
  2. 'CDSNAME'
  3. 'EXONID'
  4. 'EXONNAME'
  5. 'GENEID'
  6. 'TXID'
  7. 'TXNAME'
  1. 'CDSCHROM'
  2. 'CDSEND'
  3. 'CDSID'
  4. 'CDSNAME'
  5. 'CDSSTART'
  6. 'CDSSTRAND'
  7. 'EXONCHROM'
  8. 'EXONEND'
  9. 'EXONID'
  10. 'EXONNAME'
  11. 'EXONRANK'
  12. 'EXONSTART'
  13. 'EXONSTRAND'
  14. 'GENEID'
  15. 'TXCHROM'
  16. 'TXEND'
  17. 'TXID'
  18. 'TXNAME'
  19. 'TXSTART'
  20. 'TXSTRAND'
  21. 'TXTYPE'
'select()' returned 1:1 mapping between keys and columns
In [44]:
genes(txdb)
exonsBy(txdb)
GRanges object with 24044 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
  100009600     chr9   21062393-21075496      - |   100009600
  100009609     chr7   84940169-84964009      - |   100009609
  100009614    chr10   77711446-77712009      + |   100009614
  100009664    chr11   45808083-45842878      + |   100009664
     100012     chr4 144157556-144162651      - |      100012
        ...      ...                 ...    ... .         ...
      99889     chr3   84496093-85887518      - |       99889
      99890     chr3 110246104-110250999      - |       99890
      99899     chr3 151730923-151749959      - |       99899
      99929     chr3   65528447-65555518      + |       99929
      99982     chr4 136550533-136602723      - |       99982
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
GRangesList object of length 63759:
$1 
GRanges object with 8 ranges and 3 metadata columns:
      seqnames          ranges strand |   exon_id   exon_name exon_rank
         <Rle>       <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 4807893-4807982      + |         1        <NA>         1
  [2]     chr1 4808455-4808486      + |         2        <NA>         2
  [3]     chr1 4828584-4828649      + |         3        <NA>         3
  [4]     chr1 4830268-4830315      + |         4        <NA>         4
  [5]     chr1 4832311-4832381      + |         5        <NA>         5
  [6]     chr1 4837001-4837074      + |         6        <NA>         6
  [7]     chr1 4839387-4839488      + |         7        <NA>         7
  [8]     chr1 4840956-4842827      + |         9        <NA>         8

$2 
GRanges object with 9 ranges and 3 metadata columns:
      seqnames          ranges strand | exon_id exon_name exon_rank
  [1]     chr1 4807893-4807982      + |       1      <NA>         1
  [2]     chr1 4808455-4808486      + |       2      <NA>         2
  [3]     chr1 4828584-4828649      + |       3      <NA>         3
  [4]     chr1 4830268-4830315      + |       4      <NA>         4
  [5]     chr1 4832311-4832381      + |       5      <NA>         5
  [6]     chr1 4837001-4837074      + |       6      <NA>         6
  [7]     chr1 4839387-4839488      + |       7      <NA>         7
  [8]     chr1 4840956-4841132      + |       8      <NA>         8
  [9]     chr1 4844963-4846735      + |      10      <NA>         9

...
<63757 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome

Option 2. Find a TxDb from AnnotationHub

This opens up more annotation files from different sources. First find and choose a TxDb database:

In [45]:
suppressPackageStartupMessages(library(AnnotationHub))
ah <- AnnotationHub()
query(ah, "TxDb")
txdb <- ah[["AH52245"]] #TxDb.Athaliana.BioMart.plantsmart22.sqlite
updating metadata: retrieving 1 resource
snapshotDate(): 2018-04-30
AnnotationHub with 59 records
# snapshotDate(): 2018-04-30 
# $dataprovider: UCSC
# $species: Rattus norvegicus, Gallus gallus, Macaca mulatta, Caenorhabditis...
# $rdataclass: TxDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH52245"]]' 

            title                                        
  AH52245 | TxDb.Athaliana.BioMart.plantsmart22.sqlite   
  AH52246 | TxDb.Athaliana.BioMart.plantsmart25.sqlite   
  AH52247 | TxDb.Athaliana.BioMart.plantsmart28.sqlite   
  AH52248 | TxDb.Btaurus.UCSC.bosTau8.refGene.sqlite     
  AH52249 | TxDb.Celegans.UCSC.ce11.refGene.sqlite       
  ...       ...                                          
  AH61796 | TxDb.Mmulatta.UCSC.rheMac8.refGene.sqlite    
  AH61797 | TxDb.Ptroglodytes.UCSC.panTro4.refGene.sqlite
  AH61798 | TxDb.Rnorvegicus.UCSC.rn5.refGene.sqlite     
  AH61799 | TxDb.Rnorvegicus.UCSC.rn6.refGene.sqlite     
  AH61800 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite     
downloading 0 resources
loading from cache 
    ‘/home/ubuntu//.AnnotationHub/58983’

Then use it to create the transcript-gene map:

In [46]:
keytypes(txdb)
  1. 'CDSID'
  2. 'CDSNAME'
  3. 'EXONID'
  4. 'EXONNAME'
  5. 'GENEID'
  6. 'TXID'
  7. 'TXNAME'
In [47]:
columns(txdb)
  1. 'CDSCHROM'
  2. 'CDSEND'
  3. 'CDSID'
  4. 'CDSNAME'
  5. 'CDSSTART'
  6. 'CDSSTRAND'
  7. 'EXONCHROM'
  8. 'EXONEND'
  9. 'EXONID'
  10. 'EXONNAME'
  11. 'EXONRANK'
  12. 'EXONSTART'
  13. 'EXONSTRAND'
  14. 'GENEID'
  15. 'TXCHROM'
  16. 'TXEND'
  17. 'TXID'
  18. 'TXNAME'
  19. 'TXSTART'
  20. 'TXSTRAND'
In [48]:
k <- keys(txdb, "TXNAME")
tx5gene <- select(txdb, keys=k, keytype="TXNAME", column="GENEID")
head(tx5gene)
'select()' returned 1:1 mapping between keys and columns
TXNAMEGENEID
AT1G01010.1AT1G01010
AT1G01040.1AT1G01040
AT1G01040.2AT1G01040
AT1G01046.1AT1G01046
AT1G01073.1AT1G01073
AT1G01110.2AT1G01110

Option 3. Import a GFF annotation file from somewhere else.

To convert the GFF file these to a TxDb database, use the makeTxDbFromGFF() function from the GenomicFeatures package:

In [49]:
library(GenomicFeatures)
gffFile <- system.file("extdata", "GFF3_files", "a.gff3", 
                       package="GenomicFeatures")
txdb <- makeTxDbFromGFF(file=gffFile,
            dataSource="partial gtf file for Tomatoes for testing",
            organism="Solanum lycopersicum")
keytypes(txdb)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Warning message in .get_cds_IDX(type, phase):
“The "phase" metadata column contains non-NA values for features of type
  exon. This information was ignored.”OK
  1. 'CDSID'
  2. 'CDSNAME'
  3. 'EXONID'
  4. 'EXONNAME'
  5. 'GENEID'
  6. 'TXID'
  7. 'TXNAME'
In [50]:
columns(txdb)
  1. 'CDSCHROM'
  2. 'CDSEND'
  3. 'CDSID'
  4. 'CDSNAME'
  5. 'CDSPHASE'
  6. 'CDSSTART'
  7. 'CDSSTRAND'
  8. 'EXONCHROM'
  9. 'EXONEND'
  10. 'EXONID'
  11. 'EXONNAME'
  12. 'EXONRANK'
  13. 'EXONSTART'
  14. 'EXONSTRAND'
  15. 'GENEID'
  16. 'TXCHROM'
  17. 'TXEND'
  18. 'TXID'
  19. 'TXNAME'
  20. 'TXSTART'
  21. 'TXSTRAND'
  22. 'TXTYPE'
In [51]:
k <- keys(txdb, "TXNAME")
tx6gene <- select(txdb, keys=k, keytype="TXNAME", column="GENEID")
head(tx5gene)
'select()' returned 1:1 mapping between keys and columns
TXNAMEGENEID
AT1G01010.1AT1G01010
AT1G01040.1AT1G01040
AT1G01040.2AT1G01040
AT1G01046.1AT1G01046
AT1G01073.1AT1G01073
AT1G01110.2AT1G01110

Note, the GenomicFeatures package enables you to make a TxDb from a variety of sources:

In [52]:
grep("makeTxDb", ls("package:GenomicFeatures"), val=TRUE)
  1. 'makeTxDb'
  2. 'makeTxDbFromBiomart'
  3. 'makeTxDbFromEnsembl'
  4. 'makeTxDbFromGFF'
  5. 'makeTxDbFromGRanges'
  6. 'makeTxDbFromUCSC'
  7. 'makeTxDbPackage'
  8. 'makeTxDbPackageFromBiomart'
  9. 'makeTxDbPackageFromUCSC'