Road map
General overview of Bioconductor annotation
Levels: reference sequence, regions of interest, pathways
OrgDb: simple interface to annotation databases
Discovering reference sequence
A build of the human genome
Gene/Transcript/Exon catalogs from UCSC and Ensembl
Finding and managing gene sets
Specific annotation concerns
Ontology concepts and tools
Importing and exporting regions and scores
AnnotationHub: brokering thousands of annotation resources
Bioconductor includes many different types of genomic annotation. We can think of these annotation resources in a hierarchical structure.
suppressPackageStartupMessages({
library(BSgenome)
library(DT)
library(Homo.sapiens)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(ensembldb)
library(EnsDb.Hsapiens.v75)
library(AnnotationHub)
library(GO.db)
library(KEGGREST)
library(grid)
library(DT)
# use biocLite("genomicsclass/ph525x")
library(GSEABase)
library(ph525x)
library(annotate)
library(ontoProc)
library(ontologyPlot)
library(ERBS) # from genomicsclass github
library(rtracklayer)
})
Gene-level annotation is distributed over various resources -- in part this is a consequence of resource limitations at the time the project was started. The modular layout that has endured is not too hard to master. In the next few cells we will use org.Hs.eg.db to do basic identifier mapping, GO.db to obtain details about Gene Ontology mappings, and KEGGREST to obtain details about pathways.
Simple translations between identifier types, and mappings from gene identifiers to pathway or gene set concepts, are available using NCBI's Entrez resources, packaged in org.Hs.eg.db.
The main operation is select
, which is defined in various contexts. You may,
particularly if dplyr is on the search path, have to disambiguate your usage with
AnnotationDbi::select
.
We begin by listing the types of keys that can be used for lookups.
library(org.Hs.eg.db) # following just compacts the display a little; just call keytypes()
data.frame(split(keytypes(org.Hs.eg.db), rep(c(1,2),c(13,13))))
A typical application is to acquire the Entrez ID or Ensembl ID for a gene whose symbol is known.
select(org.Hs.eg.db, keys=c("BRCA2", "ORMDL3"), keytype="SYMBOL", columns=c("ENTREZID", "ENSEMBL"))
This can be elaborated to obtain additional information. For example, to determine Gene Ontology and KEGG pathway annotations for this gene:
# columns(org.Hs.eg.db) # learn available field names for tabulation
brcatab = select(org.Hs.eg.db, keys="BRCA2", keytype="SYMBOL", columns=c("ENTREZID", "GO", "PATH"))
dim(brcatab)
There are numerous annotations for BRCA2.
head(brcatab)
When we ask for GO annotations, we also receive an evidence code. 'TAS' stands for 'traceable author statement', and we will confine attention to these annotations.
brcatab[brcatab$EVIDENCE=="TAS",]
The values for 'PATH' are 'organism-independent' pathway tags in KEGG. To focus on a given organism, a three-letter abbreviation must be prepended: hsaNNNNN for human, mmuNNNNN for mouse, and so on.
select(GO.db, keys=c("GO:0000731", "GO:0000732", "GO:0005654"), keytype="GOID", columns="TERM")
library(KEGGREST)
lkOne = keggGet("hsa03440")[[1]] # one list element per query element
# str(lkOne)
lkOne[1:2]
krg2df = function(x){
ans = data.frame(t(matrix(x, nrow=2)))
names(ans) = c("ENTREZ", "ANNO")
ans
}
datatable(krg2df(lkOne$GENE))
It is easy to retrieve a PNG image of the pathway of interest, as curated by KEGG.
img = keggGet("hsa05200", "image")
library(grid)
grid.raster(img)
A compatible approach is available using Ensembl resources.
From the Ensembl home page: "Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics". This project is lodged at the European Molecular Biology Lab, which has been supportive of general interoperation of annotation resources with Bioconductor.
The ensembldb package includes a vignette with the following commentary:
The ensembldb package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb packages from the GenomicFeatures package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb databases created by the ensembldb package contain also protein annotation data (see Section 11 for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the proteins vignette.
library(ensembldb)
library(EnsDb.Hsapiens.v75)
#data.frame(tabs=names(listTables(EnsDb.Hsapiens.v75)))
As an illustration, we'll obtain location and protein sequence for ORMDL3.
edb = EnsDb.Hsapiens.v75 # abbreviate
#columns(edb) # learn the names of the columns
select(edb, keytype="SYMBOL", keys="ORMDL3",
columns=c("GENEID", "GENENAME", "SEQNAME", "GENESEQSTART", "GENESEQEND", "TXID", "PROTEINSEQUENCE"))
Bioconductor's collection of annotation packages brings
all elements of this hierarchy into a programmable environment.
Reference genomic sequences are managed using the infrastructure
of the Biostrings and BSgenome packages, and the available.genomes
function lists the reference genome build for humans and
various model organisms now available.
library(BSgenome)
library(DT)
ag = available.genomes()
datatable(data.frame(packs=ag))
The reference build for an organism is created de novo and then refined as algorithms and sequenced data improve. For humans, the Genome Research Consortium signed off on build 37 in 2009, and on build 38 in 2013.
Once a reference build is completed, it becomes easy to perform informative genomic sequence analysis on individuals, because one can focus on regions that are known to harbor allelic diversity.
Note that the genome sequence packages have long names that include build versions. It is very important to avoid mixing coordinates from different reference builds. In the liftOver video we show how to convert genomic coordinates of features between different reference builds, using the UCSC "liftOver" utility interfaced to R in the rtracklayer package.
To help users avoid mixing up data collected on incompatible genomic coordinate systems from different reference builds, we include a "genome" tag that can be filled out for most objects that hold sequence information. We'll see some examples of this shortly. Software for sequence comparison can check for compatible tags on the sequences being compared, and thereby help to ensure meaningful results.
The reference sequence for Homo sapiens is acquired by installing
and attaching
a single package. This is in contrast to downloading and parsing
FASTA files. The package defines an object Hsapiens
that is the source of chromosomal sequence, but when
evaluated on its own
provides a report of the origins of the sequence data that
it contains.
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
We acquire a chromosome's sequence using the $
operator.
Hsapiens$chr17
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene # abbreviate
txdb
We can use genes()
to get the addresses of genes using
Entrez Gene IDs.
ghs = genes(txdb)
ghs
Filtering is supported, with suitable identifiers. Here we select all exons identified for two different genes, identified by their Entrez Gene ids:
eForTwo = exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
filter=list(gene_id=c(100, 101)))
eForTwo
split(eForTwo, unlist(eForTwo$GENEID)) #notice that GENEID is a CharacterList
We've had a brief look at how to use org.Hs.eg.db, GO.db and KEGGREST to identify gene sets defined by ontology categories or pathways.
org.Hs.eg.db (and org.Mm.eg.db, org.Sc.sgd.db, etc.) provide GO Annotation mappings as determined by the curating organizations (NCBI, SGD, etc.). In this context gene sets are essentially vectors of identifiers.
KEGGREST delivers more information about its pathways. Pathways include genes and references to higher-level modules. References to literature and disease association are also provided.
str(lkOne)
A major repository of curated gene sets is the Broad Institute MSigDb. We use the GSEABase package to examine a snapshot of gene sets annotated to glioblastoma.
library(GSEABase)
glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
glioG
There are 3671 genes grouped into 47 sets. The names of the sets and their sizes can be reckoned through
datatable(data.frame(setnms=names(glioG), sizes=
vapply(glioG, function(x) length(geneIds(x)), numeric(1))))
To determine the overlap among gene sets, use code such as the following:
intersect(geneIds(glioG[[8]]), geneIds(glioG[[47]]))
How many genes are shared between the TCGA "copy number up" and Classical gene set identified by Verhaak?
The full structure of the GSEABase GeneSet class allows for inclusion of considerable metadata.
details(glioG[[44]])
Unfortunately, very little metadata are recorded. We can add some, recognizing that the Verhaak paper has PMID 20129251, and that the geneIdType is "symbol".
vhNeur = glioG[[44]]
pubMedIds(vhNeur) = "20129251"
details(vhNeur)
One utility of recording the PMID is that it simplifies retrieving the abstract of the associated paper:
library(annotate)
vhM = pmid2MIAME("20129251")
abstract(vhM)
Another nice feature of GSEABase is that it simplifies translation between elements of gene sets. To make use of this, we have to set the identifier type.
geneIdType(vhNeur) = SymbolIdentifier()
geneIdType(vhNeur) = AnnotationIdentifier("org.Hs.eg.db")
vhNeur
Notice that the translation of symbols to Entrez identifiers just demonstrated reduced the size of the gene set from 129 to 122. Why?
We've seen that GO.db provides gene ontology categories and tags, and org.Hs.eg.db provides mappings between categories and genes. Many ontologies have been created in the biosciences, and the ontologyIndex and rols packages are helpful for surveying the field.
Here we'll have a look at the Cell Ontology using the ontoProc helper package.
library(ontoProc)
ot = getCellOnto() # fixed date snapshot
ot
This shows that the Cell Ontology is composed of selections from a number of bioontologies.
We can visualize relationships among terms using ontologyPlot. We've selected five different terms related to neuron cells, and the plotting function expands the number of terms displayed to illustrate context.
library(ontologyPlot)
onto_plot(ot, c("CL:0000095", "CL:0000679", "CL:2000028", "CL:1001509", "CL:2000031"))
We have some textual data in bedGraph format, produced by ENCODE. It gives regions and scores for ESRRA binding in the GM12878 B-cell line.
f1 = dir(system.file("extdata",package="ERBS"), full=TRUE)[1]
f1
readLines(f1, 4) # look at a few lines
rtracklayer can be used to import this text to a GRanges instance.
library(rtracklayer)
bg1 = import.bedGraph(f1)
bg1
Note that the fields (with the exception of 'score') are unnamed, and the reference genome build is unspecified. The narrowPeak format document provides some interpretive help.
Clearly, if you had generated these results, you would want to clarify the interpretations of the fields and specify the reference genome build to which the addresses refer. Distributing your results as a GRanges would allow this binding of metadata to the experimental outcome.
In the event you need to transform data that you have as a GRanges to
a textual format, the export
method of rtracklayer can perform this.
td = tempdir()
tf = paste0(td, "/demoex.bed")
export(bg1, tf) # implicit format choice
cat(readLines(tf, n=5), sep="\n")
We have carried out a “round trip” of importing, modeling, and exporting experimental data that can be integrated with other data to advance biological understanding.
What we have to watch out for is the idea that annotation is somehow permanently correct, isolated from the cacophony of research progress at the boundaries of knowledge. We have seen that even the reference sequences of human chromosomes are subject to revision.
Bioconductor has taken pains to acknowledge many facets of this situation. We maintain archives of prior versions of software and annotation so that past work can be checked or revised. We update central annotation resources twice a year so that there is stability for ongoing work as well as access to new knowledge. And we have made it simple to import and to create representations of experimental and annotation data.
From the AnnotationHub vignette:
The AnnotationHub server provides easy R / Bioconductor access to large collections of publicly available whole genome resources, e.g,. ENSEMBL genome fasta or gtf files, UCSC chain resources, ENCODE data tracks at UCSC, etc.
We will get a general overview and then carry out a detailed query. We start by loading the package and obtaining a hub object.
library(AnnotationHub)
ah = AnnotationHub()
ah
Note that there is a specific snapshot date. The mcols
method produces metadata about
the various resources. The $
shortcut also works.
dim(mcols(ah))
The rdataclass
field of the metadata tells us what kinds of representations are available.
table(ah$rdataclass)
We can use the query
method to look for records
concerning resources available in the hub. An algorithm
for labeling chromatin regions according to epigenetic state is ChromHmm
and we search for records concerning
its application.
q1 = query(ah, "ChromHmm")
q1
Metadata related to this query can be displayed for deeper searching.
datatable(as.data.frame(mcols(q1)))
Recall the rdataclass query above.
table(ah$rdataclass)
What do the VCF files represent?
datatable(as.data.frame(mcols(ah)[which(ah$rdataclass=="VcfFile"),]))
Let's obtain the first ClinVar file. We use [[ with the tag. If we have not done this previously, the resource is added to our AnnotationHub cache.
cv1 = ah[["AH57956"]]
cv1
We'll look at the header for some hints.
cv1h = scanVcfHeader(cv1)
cv1h
The META component is usually informative.
meta(cv1h)$META
We'll now use readVcf
to extract content on a region of interest.
rv1 = readVcf(cv1, genome="GRCh37.p13", param=ScanVcfParam(which=GRanges("1", IRanges(949000, 996000))))
rv1
We now have metadata on 42 variants in the region of interest we specified. The addresses are:
rowRanges(rv1)
The info
metadata component can be picked apart into sets of columns. The first 10 columns are basic annotation.
info(rv1)[,1:10]
The last nine columns are for clinical genetics interpretation.
info(rv1)[,50:58]
To get more information like the link for the variation property bitfield you can use the info accessor on the header.
info(header(rv1))
In summary, AnnotationHub can harbor detailed information on many aspects of modern systems biology. Once you have identified something useful, it will be downloaded to your local system and cached, so that the retrieval need not recur. If the hub version is updated, the cache will be checked and refreshed as needed.
This has been a wide-ranging review of genomic annotation resources in Bioconductor. I'll annotate the road map with key functions and packages.
General overview of Bioconductor annotation
Levels: reference sequence, regions of interest, pathways
-- BSgenome.Hsapiens.UCSC.hg19, genes(Homo.sapiens)
OrgDb: simple interface to annotation databases
-- org.Hs.eg.db
Discovering reference sequence
-- available.genomes
A build of the human genome
-- Hsapiens$chr17
Gene/Transcript/Exon catalogs from UCSC and Ensembl
-- TxDb.Hsapiens.UCSC.hg19.knownGene, EnsDb.Hsapiens.v75
Finding and managing gene sets
-- GSEABase, KEGGREST
Specific annotation concerns
Ontology concepts and tools
-- ontoProc, ontologyPlot, rols
Importing and exporting regions and scores
-- rtracklayer
AnnotationHub: brokering thousands of annotation resources
-- AnnotationHub, query(), ah[[...]]