We will run through a quick hands-on demo of the workflow for a single cell RNAseq dataset using the 10X genomics Chromium platform.
For this tutorial, we will be using three R packages Seurat
, Scater
and Scran
. As we have discussed, methods and package development in this ares is highly dynamic and therefore we expect things to change very soon. However, there are a couple of standards that have been implemented:
The development of the SingleCellExperiment
data structure in Bioconductor, which will eventually be the de-facto standard for future Bioconductor pacakges
The general workflow for a given set of data
Let us now load the required packages
suppressPackageStartupMessages(library(Seurat))
library(IRdisplay)
suppressPackageStartupMessages(library(scran))
suppressPackageStartupMessages(library(scater))
This workshop is based on the Seurat-Guided Clustering Tutorial which uses aa dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. We have already downloaded this datase in prepartion.
First, we read in the data using a Seurat
function
pbmc.data <- Read10X(data.dir = "~/workshop_materials/scrna_workshop_data/filtered_gene_bc_matrices/hg19/")
Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 0, min.genes = 0, project = "10X_PBMC")
str(pbmc)
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)
We create a gene map table from the 10X data as Seurat by default only uses the gene names instead of the Ensembl ID and this can lead to difficulty in assigning cell cycle phase later on.
gene_map <- read.table("~/workshop_materials/scrna_workshop_data/filtered_gene_bc_matrices/hg19/genes.tsv",
header=F)
names(gene_map)<-c("Ens_ID","name")
head(gene_map, n=2)
Lets take a peek at the Raw Data
as.matrix(tail(pbmc@raw.data, n=2))
SingleCellExperiment
object from the seurat
object so we can use the SCATER
and SCRAN
packages
sce <- SingleCellExperiment(list(counts=as.matrix(pbmc@raw.data)))
str(sce)
Then we can add an simple CPM transformation to the original matrix count matrix and store it
exprs(sce) <- log2(calculateCPM(sce, use.size.factors = FALSE) + 1) #SCATER
dim(exprs(sce))
head(rowSums(exprs(sce)), n=2)
head(colnames(sce), n=2)
These packages also let us use spike-in data if thats available, which is not the case for droplet based methods, but works well in other technbologies. However, we can use the functionality to mark other sets of genes, in this case the gene expression annotated to be of mitochondrial origin.
isSpike(sce, "MT") <- grepl("^MT-", rownames(sce)) #SCATER
SCATER
package is that there is an automated procedure for calculating a large set of QC metics automatically.
In the case of Seurat you would have to manually calculate each metric as show below for the calculation of the mitochondirial percent UMI's (Not run):# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.
# For non-UMI data, nUMI represents the sum of the non-normalized values within a cell
# We calculate the percentage of mitochondrial genes here and store it in percent.mito using AddMetaData.
# We use object@raw.data since this represents non-transformed and non-log-normalized counts
# The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data)
# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
sce <- calculateQCMetrics(sce,feature_controls = list(MT = isSpike(sce, "MT") )) #SCATER
Lets see what metrics have been derived for:
cbind(names(colData(sce)))
cbind(names(rowData(sce)))
Finally, let us add some cell cycle information from the Scran
package. This package has default training datasets for assingment of cell cycle states to data for Human and Mouse.
But, first we use our gene_map
dataset to map the gene names to Ensembl Gene IDs as the training sets are based on Ensembl Gene IDs
ens_pos <- match(rownames(counts(sce)), gene_map$name)
rownames(sce@assays[['counts']]) <- gene_map$Ens_ID[ens_pos]
cbind(rownames(counts(sce))[1:6],as.character(gene_map$name[1:6]))
Read in the pre-computed cell-cycle data from the scran
package
hg.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))
assigned <- cyclone(sce@assays[['counts']], pairs=hg.pairs)
save(assigned,file="~/scrna_workshop/scrna_workshop_data/pbmc_3k_cyclone_assigned.rda")
load(file="~/workshop_materials/scrna_workshop_data/pbmc_3k_cyclone_assigned.rda")
Let us explore the assigned data set
head(assigned$phases)
head(assigned$scores)
table(assigned$phases)
We will now add the scores and the phases to the colData component of our SingleCellExpressionSet and save the final raw data object we have created
newcoldat <- colData(sce)
newcoldat$cycle_phases <- assigned$phases
newcoldat<- cbind(newcoldat,assigned$scores)
colData(sce) <- newcoldat
save(sce,file="~/scrna_workshop/scrna_workshop_data/pbmc_3k_sce_raw.rda")
load(file="~/workshop_materials/scrna_workshop_data/pbmc_3k_sce_raw.rda")
We are now ready to work with our dataset to do some filtering and QC.
As a first step, let us first remove genes with no expression
keep_feature <- rowSums(counts(sce) > 0 ) > 0
table(keep_feature)
As we see above approximately 50% of the genes have zero counts across all cells. This could be either due to the sensitivity of the experimental method, sequencing depth or that these genes are not expressed.
We will go ahead and proceed to drop these from our dataset
sce.filt <- sce[keep_feature,]
dim(sce.filt)
One measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells (Ilicic et al., 2016; Islam et al., 2014), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. Similar reasoning applies to the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial proportions across all cells are shown below
g1 = ggplot(as.data.frame(colData(sce.filt)))
g1 = g1 + geom_histogram(aes(x=pct_counts_MT),bins=200,fill = "grey")
#g1 = g1 + geom_density(aes(x=pct_counts_MT))
g1
The ideal threshold for these proportions depends on the cell type and the experimental protocol.
If we assume that most cells in the dataset are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We can use the MAD-based definition of outliers to remove putative low-quality cells from the dataset.
This basically filters the variable based on the Median Absolute Deviation(MAD), which is defineas as the absolute value of the deviation of each value from the median. The median is a more robust measure relative to the mean which can be influenced by outliers
mito.keep <- !(isOutlier(sce.filt$pct_counts_MT, nmads=3, type="higher")) #SCATER
table(mito.keep)
filter_by_MT <- colData(sce.filt)$pct_counts_MT < 10
table(filter_by_MT)
Two other common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with relatively small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
threshold = median(log(colData(sce.filt)$total_counts)) - 3*mad(log(colData(sce.filt)$total_counts))
threshold
Scater
provides a function isOutlier
that calculates these values and createa filter based on the threshold and the threshol type Upper, lower or both
libsize.keep <- !isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
table(libsize.keep)
Lets look at the threshold on the raw scale
g1 = ggplot(as.data.frame(colData(sce)))
g1 = g1 + geom_histogram(aes(x=total_counts, y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=total_counts))
g1 + geom_vline(xintercept = exp(threshold), color="red")
And the filtering threshold on the log scale
g1 = ggplot(as.data.frame(colData(sce)))
g1 = g1 + geom_histogram(aes(x=log(total_counts), y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=log(total_counts)))
g1 + geom_vline(xintercept = threshold, color="red")
threshold = median(log(colData(sce)$total_features)) - 3*mad(log(colData(sce)$total_features))
threshold
feature.keep <- !isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)
table(feature.keep)
g1 = ggplot(as.data.frame(colData(sce)))
g1 = g1 + geom_histogram(aes(x=log(total_features), y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=log(total_features)))
g1 + geom_vline(xintercept = threshold, color="red")
Scater
provides a way to store the filters in a vector so that we can use it at the very end to filter based on all the possible filters we generate
sce.filt$use <- (
# sufficient features (genes)
feature.keep &
# sufficient molecules counted
libsize.keep &
# remove cells with unusual number of reads in MT genes
mito.keep
)
table(sce.filt$use)
Lets also filter to keep all genes with expression values in at least 2 cells
filter_genes <- apply(counts(sce.filt[ ,sce.filt$use]), 1,
function(x)
{
length(x[x > 1]) >= 2
}
)
rowData(sce.filt)$use <- filter_genes
table(rowData(sce.filt)$use)
table(colData(sce.filt)$use)
The Scater
package provides several plot functions that are very useful and it is highly recommneded that you explore all the possibilities. Furthermore, once you have the
Use to generate multiple kinds of expression QC plots:
plotQC(sce.filt, type = "highest-expression", exprs_values = "counts")
plotExprsFreqVsMean(sce.filt)
We can generate MDS plots and annotate by cell atributes to see if there are any biases
plotMDS(sce.filt,exprs_value="logcounts", colour_by = "cycle_phases")
We can also generate expression plots and explore different genes
plotExpression(sce.filt, features=1:20)
We can also plot scatter plots to see there any correlations with cell attributee. Lets first look at the atttributes we have avaialbale
colnames(colData(sce.filt))
plotColData(sce.filt, aes(x = log10_total_counts, y = pct_counts_MT, color=cycle_phases))
plotDiffusionMap(sce.filt)
#, color_by="cycle_phases")
#, aes(x = log10_total_counts, y = total_features, colour = log10_mean_counts))
We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 10% and unique UMI counts under 20,000 or greater than 500, Note that low.thresholds and high.thresholds are used to define a 'gate' -Inf and Inf should be used if you don't want a lower or upper threshold.
Lets save the sce.filt
object for the next set of analysis
#save(sce.filt, file="~/workshop_materials/scrna_workshop_data/pbmc_3k_sce_filt.rda")