An Introduction to Data Analysis of Single Cell RNASeq: An illustration with 10X Chromium Data¶

Getting Started¶

We will run through a quick hands-on demo of the workflow for a single cell RNAseq dataset using the 10X genomics Chromium platform.

For this tutorial, we will be using three R packages Seurat, Scater and Scran. As we have discussed, methods and package development in this ares is highly dynamic and therefore we expect things to change very soon. However, there are a couple of standards that have been implemented:

The development of the SingleCellExperiment data structure in Bioconductor, which will eventually be the de-facto standard for future Bioconductor pacakges
The general workflow for a given set of data

Let us now load the required packages

suppressPackageStartupMessages(library(Seurat))

library(IRdisplay)

suppressPackageStartupMessages(library(scran))

suppressPackageStartupMessages(library(scater))

This workshop is based on the Seurat-Guided Clustering Tutorial which uses aa dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. We have already downloaded this datase in prepartion.

First, we read in the data using a Seurat function

pbmc.data <- Read10X(data.dir = "~/workshop_materials/scrna_workshop_data/filtered_gene_bc_matrices/hg19/")

user system elapsed
8.404 0.100 8.504

Initialize the Seurat object with the raw (non-normalized data).

pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 0, min.genes = 0, project = "10X_PBMC")

user system elapsed
3.040 0.016 3.057

str(pbmc)

Formal class 'seurat' [package "Seurat"] with 20 slots
  ..@ raw.data    :Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:2286884] 70 166 178 326 363 410 412 492 494 495 ...
  .. .. ..@ j       : int [1:2286884] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. ..@ Dim     : int [1:2] 32738 2700
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:32738] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7" ...
  .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
  .. .. ..@ x       : num [1:2286884] 1 1 2 1 1 1 1 41 1 1 ...
  .. .. ..@ factors : list()
  ..@ data        :Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:2286884] 70 166 178 326 363 410 412 492 494 495 ...
  .. .. ..@ j       : int [1:2286884] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. ..@ Dim     : int [1:2] 32738 2700
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:32738] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7" ...
  .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
  .. .. ..@ x       : num [1:2286884] 1 1 2 1 1 1 1 41 1 1 ...
  .. .. ..@ factors : list()
  ..@ scale.data  : NULL
  ..@ var.genes   : logi(0) 
  ..@ is.expr     : num 0
  ..@ ident       : Factor w/ 1 level "10X_PBMC": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:2700] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
  ..@ meta.data   :'data.frame':	2700 obs. of  3 variables:
  .. ..$ nGene     : int [1:2700] 781 1352 1131 960 522 782 783 790 533 550 ...
  .. ..$ nUMI      : num [1:2700] 2421 4903 3149 2639 981 ...
  .. ..$ orig.ident: Factor w/ 1 level "10X_PBMC": 1 1 1 1 1 1 1 1 1 1 ...
  ..@ project.name: chr "10X_PBMC"
  ..@ dr          : list()
  ..@ assay       : list()
  ..@ hvg.info    :'data.frame':	0 obs. of  0 variables
Formal class 'data.frame' [package "methods"] with 4 slots
  .. .. ..@ .Data    : list()
  .. .. ..@ names    : chr(0) 
  .. .. ..@ row.names: int(0) 
  .. .. ..@ .S3Class : chr "data.frame"
  ..@ imputed     :'data.frame':	0 obs. of  0 variables
Formal class 'data.frame' [package "methods"] with 4 slots
  .. .. ..@ .Data    : list()
  .. .. ..@ names    : chr(0) 
  .. .. ..@ row.names: int(0) 
  .. .. ..@ .S3Class : chr "data.frame"
  ..@ cell.names  : chr [1:2700] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
  ..@ cluster.tree: list()
  ..@ snn         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int(0) 
  .. .. ..@ p       : int 0
  .. .. ..@ Dim     : int [1:2] 0 0
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : NULL
  .. .. ..@ x       : num(0) 
  .. .. ..@ factors : list()
  ..@ calc.params :List of 1
  .. ..$ CreateSeuratObject:List of 13
  .. .. ..$ project             : chr "10X_PBMC"
  .. .. ..$ min.cells           : num 0
  .. .. ..$ min.genes           : num 0
  .. .. ..$ is.expr             : num 0
  .. .. ..$ normalization.method: NULL
  .. .. ..$ scale.factor        : num 10000
  .. .. ..$ do.scale            : logi FALSE
  .. .. ..$ do.center           : logi FALSE
  .. .. ..$ names.field         : num 1
  .. .. ..$ names.delim         : chr "_"
  .. .. ..$ display.progress    : logi TRUE
  .. .. ..$ ...                 : symbol 
  .. .. ..$ time                : POSIXct[1:1], format: "2018-04-30 19:05:50"
  ..@ kmeans      : NULL
  ..@ spatial     :Formal class 'spatial.info' [package "Seurat"] with 4 slots
  .. .. ..@ mix.probs    :'data.frame':	2700 obs. of  1 variable:
  .. .. .. ..$ nGene: int [1:2700] 781 1352 1131 960 522 782 783 790 533 550 ...
  .. .. ..@ mix.param    :'data.frame':	0 obs. of  0 variables
Formal class 'data.frame' [package "methods"] with 4 slots
  .. .. .. .. ..@ .Data    : list()
  .. .. .. .. ..@ names    : chr(0) 
  .. .. .. .. ..@ row.names: int(0) 
  .. .. .. .. ..@ .S3Class : chr "data.frame"
  .. .. ..@ final.prob   :'data.frame':	0 obs. of  0 variables
Formal class 'data.frame' [package "methods"] with 4 slots
  .. .. .. .. ..@ .Data    : list()
  .. .. .. .. ..@ names    : chr(0) 
  .. .. .. .. ..@ row.names: int(0) 
  .. .. .. .. ..@ .S3Class : chr "data.frame"
  .. .. ..@ insitu.matrix:'data.frame':	0 obs. of  0 variables
Formal class 'data.frame' [package "methods"] with 4 slots
  .. .. .. .. ..@ .Data    : list()
  .. .. .. .. ..@ names    : chr(0) 
  .. .. .. .. ..@ row.names: int(0) 
  .. .. .. .. ..@ .S3Class : chr "data.frame"
  ..@ misc        : NULL
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 2 3 0

From Seurat tutorial Keep all genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at least 200 detected genes. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. For non-UMI data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial genes here and store it in percent.mito using AddMetaData. We use object@raw.data since this represents non-transformed and non-log-normalized counts. The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

We create a gene map table from the 10X data as Seurat by default only uses the gene names instead of the Ensembl ID and this can lead to difficulty in assigning cell cycle phase later on.

gene_map <- read.table("~/workshop_materials/scrna_workshop_data/filtered_gene_bc_matrices/hg19/genes.tsv",
                       header=F)
names(gene_map)<-c("Ens_ID","name")
head(gene_map, n=2)

Lets take a peek at the Raw Data

as.matrix(tail(pbmc@raw.data, n=2))

Lets create an SingleCellExperiment object from the seurat object so we can use the SCATER and SCRAN packages

sce <- SingleCellExperiment(list(counts=as.matrix(pbmc@raw.data)))

str(sce)

Formal class 'SingleCellExperiment' [package "SingleCellExperiment"] with 10 slots
  ..@ int_elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 32738
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ int_colData        :Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 2700
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ int_metadata       :List of 1
  .. ..$ version:Classes 'package_version', 'numeric_version'  hidden list of 1
  .. .. ..$ : int [1:3] 1 0 0
  ..@ reducedDims        :Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  .. .. ..@ listData       : list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ rowRanges          :Formal class 'GRangesList' [package "GenomicRanges"] with 5 slots
  .. .. ..@ unlistData     :Formal class 'GRanges' [package "GenomicRanges"] with 6 slots
  .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. ..@ values         : Factor w/ 0 levels: 
  .. .. .. .. .. .. ..@ lengths        : int(0) 
  .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. .. ..@ start          : int(0) 
  .. .. .. .. .. .. ..@ width          : int(0) 
  .. .. .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. .. .. ..@ elementType    : chr "integer"
  .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 
  .. .. .. .. .. .. ..@ lengths        : int(0) 
  .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. .. .. ..@ nrows          : int 0
  .. .. .. .. .. .. ..@ listData       : Named list()
  .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. .. .. ..@ seqnames   : chr(0) 
  .. .. .. .. .. .. ..@ seqlengths : int(0) 
  .. .. .. .. .. .. ..@ is_circular: logi(0) 
  .. .. .. .. .. .. ..@ genome     : chr(0) 
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. ..@ nrows          : int 32738
  .. .. .. .. ..@ listData       : Named list()
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ partitioning   :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
  .. .. .. .. ..@ end            : int [1:32738] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. .. ..@ NAMES          : chr [1:32738] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7" ...
  .. .. .. .. ..@ elementType    : chr "integer"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ elementType    : chr "GRanges"
  .. .. ..@ metadata       : list()
  ..@ colData            :Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:2700] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
  .. .. ..@ nrows          : int 2700
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ assays             :Reference class 'ShallowSimpleListAssays' [package "SummarizedExperiment"] with 1 field
  .. ..$ data: NULL
  .. ..and 14 methods.
  ..@ NAMES              : NULL
  ..@ elementMetadata    :Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 32738
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ metadata           : list()

Then we can add an simple CPM transformation to the original matrix count matrix and store it

exprs(sce) <- log2(calculateCPM(sce, use.size.factors = FALSE) + 1)  #SCATER

dim(exprs(sce))

head(rowSums(exprs(sce)), n=2)

head(colnames(sce), n=2)

These packages also let us use spike-in data if thats available, which is not the case for droplet based methods, but works well in other technbologies. However, we can use the functionality to mark other sets of genes, in this case the gene expression annotated to be of mitochondrial origin.

isSpike(sce, "MT") <- grepl("^MT-", rownames(sce)) #SCATER

One advantage of the SCATER package is that there is an automated procedure for calculating a large set of QC metics automatically. In the case of Seurat you would have to manually calculate each metric as show below for the calculation of the mitochondirial percent UMI's (Not run):

# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.
# For non-UMI data, nUMI represents the sum of the non-normalized values within a cell
# We calculate the percentage of mitochondrial genes here and store it in percent.mito using AddMetaData.
# We use object@raw.data since this represents non-transformed and non-log-normalized counts
# The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.


mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data)



# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")

sce <- calculateQCMetrics(sce,feature_controls = list(MT = isSpike(sce, "MT") )) #SCATER

user system elapsed
9.988 0.268 10.257

Lets see what metrics have been derived for:

The cells

cbind(names(colData(sce)))

The genes

cbind(names(rowData(sce)))

Finally, let us add some cell cycle information from the Scran package. This package has default training datasets for assingment of cell cycle states to data for Human and Mouse.

But, first we use our gene_map dataset to map the gene names to Ensembl Gene IDs as the training sets are based on Ensembl Gene IDs

ens_pos <- match(rownames(counts(sce)), gene_map$name)
rownames(sce@assays[['counts']]) <- gene_map$Ens_ID[ens_pos]

INFO!! Double check that the assignments are correct

cbind(rownames(counts(sce))[1:6],as.character(gene_map$name[1:6]))

Read in the pre-computed cell-cycle data from the scran package

hg.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))

INFO: For using the precomputed mouse sets we do:

mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))

NOTE: We will be using the code below to assign the cell-cycle phases for each cell using the cyclone algorithm presented in [Scialdone, A. et al](http://linkinghub.elsevier.com/retrieve/pii/S1046202315300098). We do not run this as it takes > 1 hour to run, instead these values are precomputed and stored as an R object that we will load.

assigned <- cyclone(sce@assays[['counts']], pairs=hg.pairs)
save(assigned,file="~/scrna_workshop/scrna_workshop_data/pbmc_3k_cyclone_assigned.rda")

user system elapsed
4734.160 0.000 4734.586

load(file="~/workshop_materials/scrna_workshop_data/pbmc_3k_cyclone_assigned.rda")

Let us explore the assigned data set

head(assigned$phases)

head(assigned$scores)

table(assigned$phases)

  G1  G2M    S 
2541   63   96

We will now add the scores and the phases to the colData component of our SingleCellExpressionSet and save the final raw data object we have created

newcoldat <- colData(sce)

newcoldat$cycle_phases <- assigned$phases

newcoldat<- cbind(newcoldat,assigned$scores)

colData(sce) <- newcoldat

We will save the raw data object in case we want to work with it using a different filtering strategy

save(sce,file="~/scrna_workshop/scrna_workshop_data/pbmc_3k_sce_raw.rda")

load(file="~/workshop_materials/scrna_workshop_data/pbmc_3k_sce_raw.rda")

QC and Filtering¶

We are now ready to work with our dataset to do some filtering and QC.

As a first step, let us first remove genes with no expression

keep_feature <- rowSums(counts(sce) > 0 ) > 0
table(keep_feature)

keep_feature
FALSE  TRUE 
16104 16634

As we see above approximately 50% of the genes have zero counts across all cells. This could be either due to the sensitivity of the experimental method, sequencing depth or that these genes are not expressed.

We will go ahead and proceed to drop these from our dataset

sce.filt <- sce[keep_feature,]

dim(sce.filt)

QC on Cells¶

One measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells (Ilicic et al., 2016; Islam et al., 2014), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. Similar reasoning applies to the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial proportions across all cells are shown below

g1 = ggplot(as.data.frame(colData(sce.filt))) 
g1 = g1 + geom_histogram(aes(x=pct_counts_MT),bins=200,fill = "grey")
#g1 = g1 + geom_density(aes(x=pct_counts_MT)) 
g1

The ideal threshold for these proportions depends on the cell type and the experimental protocol.

Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions.
Similarly, cells with more endogenous RNA or that are assayed with protocols using less spike-in RNA will have lower spike-in proportions.

If we assume that most cells in the dataset are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We can use the MAD-based definition of outliers to remove putative low-quality cells from the dataset.

This basically filters the variable based on the Median Absolute Deviation(MAD), which is defineas as the absolute value of the deviation of each value from the median. The median is a more robust measure relative to the mean which can be influenced by outliers

mito.keep <- !(isOutlier(sce.filt$pct_counts_MT, nmads=3, type="higher")) #SCATER
table(mito.keep)

mito.keep
FALSE  TRUE 
   93  2607

NOTE: The Seurat tutorial just uses an threshold based approach which filters by a fixed number and they recommend filtering cells with mitochondrial expression > 10%

filter_by_MT <- colData(sce.filt)$pct_counts_MT < 10 
table(filter_by_MT)

filter_by_MT
FALSE  TRUE 
    6  2694

Library size¶

Two other common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with relatively small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.

See below for how the threshold is calculated using the Median Absolute Deviation (MAD) from the data

threshold = median(log(colData(sce.filt)$total_counts)) - 3*mad(log(colData(sce.filt)$total_counts))
threshold

Scater provides a function isOutlier that calculates these values and createa filter based on the threshold and the threshol type Upper, lower or both

libsize.keep <- !isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
table(libsize.keep)

libsize.keep
FALSE  TRUE 
   79  2621

Lets look at the threshold on the raw scale

g1 = ggplot(as.data.frame(colData(sce))) 
g1 = g1 + geom_histogram(aes(x=total_counts, y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=total_counts)) 
g1 + geom_vline(xintercept = exp(threshold), color="red")

And the filtering threshold on the log scale

g1 = ggplot(as.data.frame(colData(sce))) 
g1 = g1 + geom_histogram(aes(x=log(total_counts), y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=log(total_counts)))
g1 + geom_vline(xintercept = threshold, color="red")

Lets run filtering on the total number of genes detected per cell

threshold = median(log(colData(sce)$total_features)) - 3*mad(log(colData(sce)$total_features))
threshold
                   
feature.keep <- !isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)
table(feature.keep)

feature.keep
FALSE  TRUE 
   97  2603

g1 = ggplot(as.data.frame(colData(sce))) 
g1 = g1 + geom_histogram(aes(x=log(total_features), y= ..density..),bins=200,fill = "grey")
g1 = g1 + geom_density(aes(x=log(total_features)))
g1 + geom_vline(xintercept = threshold, color="red")

Scater provides a way to store the filters in a vector so that we can use it at the very end to filter based on all the possible filters we generate

sce.filt$use <- (
    # sufficient features (genes)
    feature.keep &
    # sufficient molecules counted
    libsize.keep &
    # remove cells with unusual number of reads in MT genes
    mito.keep
)

table(sce.filt$use)

FALSE  TRUE 
  174  2526

Lets also filter to keep all genes with expression values in at least 2 cells

filter_genes <- apply(counts(sce.filt[ ,sce.filt$use]), 1, 
    function(x)
        {
        length(x[x > 1]) >= 2
        }
    )

rowData(sce.filt)$use <- filter_genes
table(rowData(sce.filt)$use)

FALSE  TRUE 
 9550  7084

table(colData(sce.filt)$use)

FALSE  TRUE 
  174  2526

The Scater package provides several plot functions that are very useful and it is highly recommneded that you explore all the possibilities. Furthermore, once you have the Use to generate multiple kinds of expression QC plots:

plotQC(sce.filt, type = "highest-expression", exprs_values = "counts")

plotExprsFreqVsMean(sce.filt)

`geom_smooth()` using method = 'loess'

We can generate MDS plots and annotate by cell atributes to see if there are any biases

plotMDS(sce.filt,exprs_value="logcounts", colour_by = "cycle_phases")

We can also generate expression plots and explore different genes

plotExpression(sce.filt, features=1:20)

We can also plot scatter plots to see there any correlations with cell attributee. Lets first look at the atttributes we have avaialbale

colnames(colData(sce.filt))

plotColData(sce.filt, aes(x = log10_total_counts, y = pct_counts_MT, color=cycle_phases))

Finally we can also use plot diffusion maps a non-linear dimension reduction approach that

plotDiffusionMap(sce.filt)
#, color_by="cycle_phases")
#, aes(x = log10_total_counts, y = total_features, colour = log10_mean_counts))

We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 10% and unique UMI counts under 20,000 or greater than 500, Note that low.thresholds and high.thresholds are used to define a 'gate' -Inf and Inf should be used if you don't want a lower or upper threshold.

Lets save the sce.filt object for the next set of analysis

#save(sce.filt, file="~/workshop_materials/scrna_workshop_data/pbmc_3k_sce_filt.rda")

	AAACATACAACCAC	AAACATTGAGCTAC	AAACATTGATCAGC	AAACCGTGCTTCCG	AAACCGTGTATGCG	AAACGCACTGGTAC	AAACGCTGACCAGT	AAACGCTGGTTCTT	AAACGCTGTAGCCA	AAACGCTGTTTCTG	⋯	TTTCAGTGTCACGA	TTTCAGTGTCTATC	TTTCAGTGTGCAGT	TTTCCAGAGGTGAG	TTTCGAACACCTGA	TTTCGAACTCTCAT	TTTCTACTGAGGCA	TTTCTACTTCCTCG	TTTGCATGAGAGGC	TTTGCATGCCTCAC
AC002321.2	0	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	0	0	0
AC002321.1	0	0	0	0	0	0	0	0	0	0	⋯	0	0	0	0	0	0	0	0	0	0

G1	S	G2M
0.989	0.263	0.002
0.541	0.659	0.045
0.998	0.423	0.000
0.947	0.088	0.079
0.428	0.292	0.202
0.965	0.092	0.049

MIR1302-10	MIR1302-10
FAM138A	FAM138A
OR4F5	OR4F5
RP11-34P13.7	RP11-34P13.7
RP11-34P13.8	RP11-34P13.8
AL627309.1	AL627309.1