Brown Univ. Introduction to Bioconductor 2018, Period 3

Genomic annotation with Bioconductor

Road map

General overview of Bioconductor annotation
  Levels: reference sequence, regions of interest, pathways
  OrgDb: simple interface to annotation databases
  Discovering reference sequence
  A build of the human genome
  Gene/Transcript/Exon catalogs from UCSC and Ensembl
  Finding and managing gene sets

Specific annotation concerns
  Ontology concepts and tools
  Importing and exporting regions and scores
  AnnotationHub: brokering thousands of annotation resources

A hierarchy of annotation concepts

Bioconductor includes many different types of genomic annotation. We can think of these annotation resources in a hierarchical structure.

  • At the base is the reference genomic sequence for an organism. This is always arranged into chromosomes, consisting of linear sequences of nucleotides.
  • Above this is the organization of chromosomal sequence into regions of interest. The most prominent regions of interest are genes, but other structures like SNPs or CpG sites are annotated as well. Genes have internal structure, with parts that are transcribed and parts that are not, and "gene models" define the ways in which these structures are labeled and laid out in genomic coordinates.
  • Within this concept of regions of interest we also identify platform-oriented annotation. This type of annotation is typically provided first by the manufacturer of an assay, but then refined as research identifies ambiguities or updates to initially declared roles for assay probe elements. The brainarray project at University of Michigan illustrates this process for affymetrix array annotation. We address this topic of platform-oriented annotation at the very end of this chapter.
  • Above this is the organization of regions (most often genes or gene products) into groups with shared structural or functional properties. Examples include pathways, groups of genes found together in cells, or identified as cooperating in biological processes.
In [1]:
suppressPackageStartupMessages({
    library(BSgenome)
    library(DT)
    library(Homo.sapiens)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(org.Hs.eg.db)
    library(ensembldb)
    library(EnsDb.Hsapiens.v75)
    library(AnnotationHub)
    library(GO.db)
    library(KEGGREST)
    library(grid)
    library(DT)
    # use biocLite("genomicsclass/ph525x")
    library(GSEABase)
    library(ph525x)
    library(annotate)
    library(ontoProc)
    library(ontologyPlot)
    library(ERBS) # from genomicsclass github
    library(rtracklayer)
})

An easy way in to annotation for H. sapiens

Gene-level annotation is distributed over various resources -- in part this is a consequence of resource limitations at the time the project was started. The modular layout that has endured is not too hard to master. In the next few cells we will use org.Hs.eg.db to do basic identifier mapping, GO.db to obtain details about Gene Ontology mappings, and KEGGREST to obtain details about pathways.

Simple translations between identifier types, and mappings from gene identifiers to pathway or gene set concepts, are available using NCBI's Entrez resources, packaged in org.Hs.eg.db.

The main operation is select, which is defined in various contexts. You may, particularly if dplyr is on the search path, have to disambiguate your usage with AnnotationDbi::select.

Working with org.Hs.eg.db

We begin by listing the types of keys that can be used for lookups.

In [2]:
library(org.Hs.eg.db) # following just compacts the display a little; just call keytypes()
data.frame(split(keytypes(org.Hs.eg.db), rep(c(1,2),c(13,13))))
X1X2
ACCNUM MAP
ALIAS OMIM
ENSEMBL ONTOLOGY
ENSEMBLPROT ONTOLOGYALL
ENSEMBLTRANSPATH
ENTREZID PFAM
ENZYME PMID
EVIDENCE PROSITE
EVIDENCEALL REFSEQ
GENENAME SYMBOL
GO UCSCKG
GOALL UNIGENE
IPI UNIPROT

A typical application is to acquire the Entrez ID or Ensembl ID for a gene whose symbol is known.

In [3]:
select(org.Hs.eg.db, keys=c("BRCA2", "ORMDL3"), keytype="SYMBOL", columns=c("ENTREZID", "ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
SYMBOLENTREZIDENSEMBL
BRCA2 675 ENSG00000139618
ORMDL3 94103 ENSG00000172057

This can be elaborated to obtain additional information. For example, to determine Gene Ontology and KEGG pathway annotations for this gene:

In [4]:
# columns(org.Hs.eg.db) # learn available field names for tabulation
brcatab = select(org.Hs.eg.db, keys="BRCA2", keytype="SYMBOL", columns=c("ENTREZID", "GO", "PATH"))
dim(brcatab)
'select()' returned 1:many mapping between keys and columns
  1. 147
  2. 6

There are numerous annotations for BRCA2.

In [5]:
head(brcatab)
SYMBOLENTREZIDGOEVIDENCEONTOLOGYPATH
BRCA2 675 GO:0000722IEA BP 03440
BRCA2 675 GO:0000722IEA BP 05200
BRCA2 675 GO:0000722IEA BP 05212
BRCA2 675 GO:0000724IDA BP 03440
BRCA2 675 GO:0000724IDA BP 05200
BRCA2 675 GO:0000724IDA BP 05212

When we ask for GO annotations, we also receive an evidence code. 'TAS' stands for 'traceable author statement', and we will confine attention to these annotations.

In [6]:
brcatab[brcatab$EVIDENCE=="TAS",]
SYMBOLENTREZIDGOEVIDENCEONTOLOGYPATH
10BRCA2 675 GO:0000731TAS BP 03440
11BRCA2 675 GO:0000731TAS BP 05200
12BRCA2 675 GO:0000731TAS BP 05212
13BRCA2 675 GO:0000732TAS BP 03440
14BRCA2 675 GO:0000732TAS BP 05200
15BRCA2 675 GO:0000732TAS BP 05212
46BRCA2 675 GO:0005654TAS CC 03440
47BRCA2 675 GO:0005654TAS CC 05200
48BRCA2 675 GO:0005654TAS CC 05212

The values for 'PATH' are 'organism-independent' pathway tags in KEGG. To focus on a given organism, a three-letter abbreviation must be prepended: hsaNNNNN for human, mmuNNNNN for mouse, and so on.

Using GO.db

To learn the interpretation of the GO annotation tags, we use GO.db:

In [7]:
select(GO.db, keys=c("GO:0000731", "GO:0000732", "GO:0005654"), keytype="GOID", columns="TERM")
'select()' returned 1:1 mapping between keys and columns
GOIDTERM
GO:0000731 DNA synthesis involved in DNA repair
GO:0000732 strand displacement
GO:0005654 nucleoplasm

Using KEGG through the RESTful interface

We can decode KEGG pathway codes with an internet connection, using KEGGREST

In [8]:
library(KEGGREST)
lkOne = keggGet("hsa03440")[[1]] # one list element per query element
# str(lkOne)
lkOne[1:2]
krg2df = function(x){
    ans = data.frame(t(matrix(x, nrow=2)))
    names(ans) = c("ENTREZ", "ANNO")
    ans
    }
datatable(krg2df(lkOne$GENE))
$ENTRY
Pathway: 'hsa03440'
$NAME
'Homologous recombination - Homo sapiens (human)'
datatables

It is easy to retrieve a PNG image of the pathway of interest, as curated by KEGG.

In [9]:
img = keggGet("hsa05200", "image")
library(grid)
grid.raster(img)

The Ensembl resources for H. sapiens

A compatible approach is available using Ensembl resources.

From the Ensembl home page: "Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics". This project is lodged at the European Molecular Biology Lab, which has been supportive of general interoperation of annotation resources with Bioconductor.

The ensembldb package includes a vignette with the following commentary:

The ensembldb package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb packages from the GenomicFeatures package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb databases created by the ensembldb package contain also protein annotation data (see Section 11 for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the proteins vignette.

In [10]:
library(ensembldb)
library(EnsDb.Hsapiens.v75)
#data.frame(tabs=names(listTables(EnsDb.Hsapiens.v75)))

As an illustration, we'll obtain location and protein sequence for ORMDL3.

In [11]:
edb = EnsDb.Hsapiens.v75  # abbreviate
#columns(edb)  # learn the names of the columns
select(edb, keytype="SYMBOL", keys="ORMDL3", 
        columns=c("GENEID", "GENENAME", "SEQNAME", "GENESEQSTART", "GENESEQEND", "TXID", "PROTEINSEQUENCE"))
SYMBOLGENEIDGENENAMESEQNAMEGENESEQSTARTGENESEQENDTXIDPROTEINSEQUENCE
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000304046 MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000579695 MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000579287 NA
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000581284 XNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000394169 MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000584220 MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000584000 MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTV
ORMDL3 ENSG00000172057 ORMDL3 17 38077294 38083854 ENST00000582052 NA

Genomic sequence and gene models for model organisms

Discovering available reference genomes

Bioconductor's collection of annotation packages brings all elements of this hierarchy into a programmable environment. Reference genomic sequences are managed using the infrastructure of the Biostrings and BSgenome packages, and the available.genomes function lists the reference genome build for humans and various model organisms now available.

In [12]:
library(BSgenome)
library(DT)
ag = available.genomes()
datatable(data.frame(packs=ag))
datatables

Reference build versions are important

The reference build for an organism is created de novo and then refined as algorithms and sequenced data improve. For humans, the Genome Research Consortium signed off on build 37 in 2009, and on build 38 in 2013.

Once a reference build is completed, it becomes easy to perform informative genomic sequence analysis on individuals, because one can focus on regions that are known to harbor allelic diversity.

Note that the genome sequence packages have long names that include build versions. It is very important to avoid mixing coordinates from different reference builds. In the liftOver video we show how to convert genomic coordinates of features between different reference builds, using the UCSC "liftOver" utility interfaced to R in the rtracklayer package.

To help users avoid mixing up data collected on incompatible genomic coordinate systems from different reference builds, we include a "genome" tag that can be filled out for most objects that hold sequence information. We'll see some examples of this shortly. Software for sequence comparison can check for compatible tags on the sequences being compared, and thereby help to ensure meaningful results.

A reference genomic sequence for H. sapiens

The reference sequence for Homo sapiens is acquired by installing and attaching a single package. This is in contrast to downloading and parsing FASTA files. The package defines an object Hsapiens that is the source of chromosomal sequence, but when evaluated on its own provides a report of the origins of the sequence data that it contains.

In [13]:
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# 93 sequences:
#   chr1                  chr2                  chr3                 
#   chr4                  chr5                  chr6                 
#   chr7                  chr8                  chr9                 
#   chr10                 chr11                 chr12                
#   chr13                 chr14                 chr15                
#   ...                   ...                   ...                  
#   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
#   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
#   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
#   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
#   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

We acquire a chromosome's sequence using the $ operator.

In [14]:
Hsapiens$chr17
  81195210-letter "DNAString" instance
seq: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGCAT...GTGGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT

The transcripts and genes for a reference sequence

UCSC annotation

The TxDb family of packages and data objects manages information on transcripts and gene models. We consider those derived from annotation tables prepared for the UCSC genome browser.

In [15]:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene # abbreviate
txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg19
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: GRCh37
# transcript_nrow: 82960
# exon_nrow: 289969
# cds_nrow: 237533
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
# GenomicFeatures version at creation time: 1.21.30
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

We can use genes() to get the addresses of genes using Entrez Gene IDs.

In [16]:
ghs = genes(txdb)
ghs
GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand |     gene_id
           <Rle>              <IRanges>  <Rle> | <character>
      1    chr19 [ 58858172,  58874214]      - |           1
     10     chr8 [ 18248755,  18258723]      + |          10
    100    chr20 [ 43248163,  43280376]      - |         100
   1000    chr18 [ 25530930,  25757445]      - |        1000
  10000     chr1 [243651535, 244006886]      - |       10000
    ...      ...                    ...    ... .         ...
   9991     chr9 [114979995, 115095944]      - |        9991
   9992    chr21 [ 35736323,  35743440]      + |        9992
   9993    chr22 [ 19023795,  19109967]      - |        9993
   9994     chr6 [ 90539619,  90584155]      + |        9994
   9997    chr22 [ 50961997,  50964905]      - |        9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Filtering is supported, with suitable identifiers. Here we select all exons identified for two different genes, identified by their Entrez Gene ids:

In [17]:
eForTwo = exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
                  filter=list(gene_id=c(100, 101)))
eForTwo
GRanges object with 39 ranges and 3 metadata columns:
       seqnames                 ranges strand |    EXONID
          <Rle>              <IRanges>  <Rle> | <integer>
   [1]    chr10 [135075920, 135076737]      - |    144421
   [2]    chr10 [135077192, 135077269]      - |    144422
   [3]    chr10 [135080856, 135080921]      - |    144423
   [4]    chr10 [135081433, 135081570]      - |    144424
   [5]    chr10 [135081433, 135081622]      - |    144425
   ...      ...                    ...    ... .       ...
  [35]    chr20   [43254210, 43254325]      - |    256371
  [36]    chr20   [43255097, 43255240]      - |    256372
  [37]    chr20   [43257688, 43257810]      - |    256373
  [38]    chr20   [43264868, 43264929]      - |    256374
  [39]    chr20   [43280216, 43280376]      - |    256375
                                 TXNAME          GENEID
                        <CharacterList> <CharacterList>
   [1] uc009ybi.3,uc010qva.2,uc021qbe.1             101
   [2]            uc009ybi.3,uc021qbe.1             101
   [3] uc009ybi.3,uc010qva.2,uc021qbe.1             101
   [4]                       uc009ybi.3             101
   [5]            uc010qva.2,uc021qbe.1             101
   ...                              ...             ...
  [35]                       uc002xmj.3             100
  [36]                       uc002xmj.3             100
  [37]                       uc002xmj.3             100
  [38]                       uc002xmj.3             100
  [39]                       uc002xmj.3             100
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [18]:
split(eForTwo, unlist(eForTwo$GENEID)) #notice that GENEID is a CharacterList
GRangesList object of length 2:
$100 
GRanges object with 12 ranges and 3 metadata columns:
       seqnames               ranges strand |    EXONID          TXNAME
          <Rle>            <IRanges>  <Rle> | <integer> <CharacterList>
   [1]    chr20 [43248163, 43248488]      - |    256364      uc002xmj.3
   [2]    chr20 [43248940, 43249042]      - |    256365      uc002xmj.3
   [3]    chr20 [43249659, 43249788]      - |    256366      uc002xmj.3
   [4]    chr20 [43251229, 43251293]      - |    256367      uc002xmj.3
   [5]    chr20 [43251470, 43251571]      - |    256368      uc002xmj.3
   ...      ...                  ...    ... .       ...             ...
   [8]    chr20 [43254210, 43254325]      - |    256371      uc002xmj.3
   [9]    chr20 [43255097, 43255240]      - |    256372      uc002xmj.3
  [10]    chr20 [43257688, 43257810]      - |    256373      uc002xmj.3
  [11]    chr20 [43264868, 43264929]      - |    256374      uc002xmj.3
  [12]    chr20 [43280216, 43280376]      - |    256375      uc002xmj.3
                GENEID
       <CharacterList>
   [1]             100
   [2]             100
   [3]             100
   [4]             100
   [5]             100
   ...             ...
   [8]             100
   [9]             100
  [10]             100
  [11]             100
  [12]             100

...
<1 more element>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

Finding and managing gene sets

We've had a brief look at how to use org.Hs.eg.db, GO.db and KEGGREST to identify gene sets defined by ontology categories or pathways.

org.Hs.eg.db (and org.Mm.eg.db, org.Sc.sgd.db, etc.) provide GO Annotation mappings as determined by the curating organizations (NCBI, SGD, etc.). In this context gene sets are essentially vectors of identifiers.

KEGGREST delivers more information about its pathways. Pathways include genes and references to higher-level modules. References to literature and disease association are also provided.

In [19]:
str(lkOne)
List of 12
 $ ENTRY      : Named chr "hsa03440"
  ..- attr(*, "names")= chr "Pathway"
 $ NAME       : chr "Homologous recombination - Homo sapiens (human)"
 $ DESCRIPTION: chr "Homologous recombination (HR) is essential for the accurate repair of DNA double-strand breaks (DSBs), potentia"| __truncated__
 $ CLASS      : chr "Genetic Information Processing; Replication and repair"
 $ PATHWAY_MAP: Named chr "Homologous recombination"
  ..- attr(*, "names")= chr "hsa03440"
 $ MODULE     : Named chr [1:3] "RPA complex [PATH:hsa03440]" "MRN complex [PATH:hsa03440]" "BRCA1-associated genome surveillance complex (BASC) [PATH:hsa03440]"
  ..- attr(*, "names")= chr [1:3] "hsa_M00288" "hsa_M00291" "hsa_M00295"
 $ DISEASE    : Named chr [1:7] "Ataxia telangiectasia (AT)" "Immunodeficiency associated with DNA repair defects" "Defects in RecQ helicases" "Congenital mirror movements (CMM)" ...
  ..- attr(*, "names")= chr [1:7] "H00064" "H00094" "H00296" "H01287" ...
 $ DBLINKS    : chr [1:2] "BSID: 83046" "GO: 0000724"
 $ ORGANISM   : Named chr "NA  Homo sapiens (human) [GN:hsa]"
  ..- attr(*, "names")= chr "Homo sapiens (human) [GN:hsa]"
 $ GENE       : chr [1:82] "6742" "SSBP1; single stranded DNA binding protein 1 [KO:K03111]" "10111" "RAD50; RAD50 double strand break repair protein [KO:K10866] [EC:3.6.-.-]" ...
 $ KO_PATHWAY : chr "ko03440"
 $ REFERENCE  :List of 7
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:18086882"
  .. ..$ AUTHORS  : chr "Maloisel L, Fabre F, Gangloff S."
  .. ..$ TITLE    : chr "DNA polymerase delta is preferentially recruited during homologous recombination to promote heteroduplex DNA extension."
  .. ..$ JOURNAL  : chr [1:2] "Mol Cell Biol 28:1373-82 (2008)" "DOI:10.1128/MCB.01651-07"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:17571595"
  .. ..$ AUTHORS  : chr "Haruta-Takahashi N, Iwasaki H."
  .. ..$ TITLE    : chr "[Recombination mediators] Japanese"
  .. ..$ JOURNAL  : chr "Seikagaku 79:449-53 (2007)"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:18430459"
  .. ..$ AUTHORS  : chr "Ouyang KJ, Woo LL, Ellis NA."
  .. ..$ TITLE    : chr "Homologous recombination and maintenance of genome integrity: Cancer and aging through the prism of human RecQ helicases."
  .. ..$ JOURNAL  : chr [1:2] "Mech Ageing Dev 129:425-40 (2008)" "DOI:10.1016/j.mad.2008.03.003"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:15902993"
  .. ..$ AUTHORS  : chr "Kawabata M, Kawabata T, Nishibori M."
  .. ..$ TITLE    : chr "Role of recA/RAD51 family proteins in mammals."
  .. ..$ JOURNAL  : chr "Acta Med Okayama 59:1-9 (2005)"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:12589470"
  .. ..$ AUTHORS  : chr "Prado F, Cortes-Ledesma F, Huertas P, Aguilera A."
  .. ..$ TITLE    : chr "Mitotic recombination in Saccharomyces cerevisiae."
  .. ..$ JOURNAL  : chr [1:2] "Curr Genet 42:185-98 (2003)" "DOI:10.1007/s00294-002-0346-3"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:16935872"
  .. ..$ AUTHORS  : chr "Heyer WD, Li X, Rolfsmeier M, Zhang XP."
  .. ..$ TITLE    : chr "Rad54: the Swiss Army knife of homologous recombination?"
  .. ..$ JOURNAL  : chr [1:2] "Nucleic Acids Res 34:4115-25 (2006)" "DOI:10.1093/nar/gkl481"
  ..$ :List of 4
  .. ..$ REFERENCE: chr "PMID:17395553"
  .. ..$ AUTHORS  : chr "Michel B, Boubakri H, Baharoglu Z, LeMasson M, Lestini R."
  .. ..$ TITLE    : chr "Recombination proteins and rescue of arrested replication forks."
  .. ..$ JOURNAL  : chr [1:2] "DNA Repair (Amst) 6:967-80 (2007)" "DOI:10.1016/j.dnarep.2007.02.016"

A major repository of curated gene sets is the Broad Institute MSigDb. We use the GSEABase package to examine a snapshot of gene sets annotated to glioblastoma.

In [20]:
library(GSEABase)
glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
glioG
Warning message in readLines(con, ...):
"incomplete final line found on '/home/ubuntu/R-3.4.3/library/ph525x/gmt/glioSets.gmt'"
GeneSetCollection
  names: BALDWIN_PRKCI_TARGETS_UP, BEIER_GLIOMA_STEM_CELL_DN, ..., ZHENG_GLIOBLASTOMA_PLASTICITY_UP (47 total)
  unique identifiers: ADA, AQP9, ..., ZFP28 (3671 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: NullCollection (1 total)

There are 3671 genes grouped into 47 sets. The names of the sets and their sizes can be reckoned through

In [21]:
datatable(data.frame(setnms=names(glioG), sizes=
                        vapply(glioG, function(x) length(geneIds(x)), numeric(1))))
datatables

To determine the overlap among gene sets, use code such as the following:

In [22]:
intersect(geneIds(glioG[[8]]), geneIds(glioG[[47]]))
  1. 'GPD2'
  2. 'TOP2A'
  3. 'TRIM2'
  4. 'ZNF367'

Exercise.

How many genes are shared between the TCGA "copy number up" and Classical gene set identified by Verhaak?

The full structure of the GSEABase GeneSet class allows for inclusion of considerable metadata.

In [23]:
details(glioG[[44]])
setName: VERHAAK_GLIOBLASTOMA_NEURAL 
geneIds: ACSL4, ACYP2, ..., ZNF323 (total: 129)
geneIdType: Null
collectionType: Null 
setIdentifier: ip-172-31-1-243:2218:Thu Feb  8 22:27:58 2018:45
description: > Genes correlated with neural type of glioblastoma multiforme tumors.
organism: 
pubMedIds: 
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

Unfortunately, very little metadata are recorded. We can add some, recognizing that the Verhaak paper has PMID 20129251, and that the geneIdType is "symbol".

In [24]:
vhNeur = glioG[[44]]
pubMedIds(vhNeur) = "20129251"
details(vhNeur)
setName: VERHAAK_GLIOBLASTOMA_NEURAL 
geneIds: ACSL4, ACYP2, ..., ZNF323 (total: 129)
geneIdType: Null
collectionType: Null 
setIdentifier: ip-172-31-1-243:2218:Thu Feb  8 22:28:00 2018:49
description: > Genes correlated with neural type of glioblastoma multiforme tumors.
organism: 
pubMedIds: 20129251
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

One utility of recording the PMID is that it simplifies retrieving the abstract of the associated paper:

In [25]:
library(annotate)
vhM = pmid2MIAME("20129251")
abstract(vhM)
'The Cancer Genome Atlas Network recently cataloged recurrent genomic abnormalities in glioblastoma multiforme (GBM). We describe a robust gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrate multidimensional genomic data to establish patterns of somatic mutations and DNA copy number. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Gene signatures of normal brain cell types show a strong relationship between subtypes and different neural lineages. Additionally, response to aggressive therapy differs by subtype, with the greatest benefit in the Classical subtype and no benefit in the Proneural subtype. We provide a framework that unifies transcriptomic and genomic dimensions for GBM molecular stratification with important implications for future studies.Copyright (c) 2010 Elsevier Inc. All rights reserved.'

Another nice feature of GSEABase is that it simplifies translation between elements of gene sets. To make use of this, we have to set the identifier type.

In [26]:
geneIdType(vhNeur) = SymbolIdentifier()
geneIdType(vhNeur) = AnnotationIdentifier("org.Hs.eg.db")
vhNeur
setName: VERHAAK_GLIOBLASTOMA_NEURAL 
geneIds: 2182, 98, ..., 51646 (total: 122)
geneIdType: Annotation (org.Hs.eg.db)
collectionType: Null 
details: use 'details(object)'

Exercise

Notice that the translation of symbols to Entrez identifiers just demonstrated reduced the size of the gene set from 129 to 122. Why?

Ontology concepts and tools with Bioconductor

We've seen that GO.db provides gene ontology categories and tags, and org.Hs.eg.db provides mappings between categories and genes. Many ontologies have been created in the biosciences, and the ontologyIndex and rols packages are helpful for surveying the field.

Here we'll have a look at the Cell Ontology using the ontoProc helper package.

In [27]:
library(ontoProc)
ot = getCellOnto()  # fixed date snapshot
ot
Ontology with 6546 terms



Properties:
	id: character
	name: character
	parents: list
	children: list
	ancestors: list
	obsolete: logical
Roots:
	GO:0008150 - biological_process
	GO:0005575 - cellular_component
	UBERON:0001062 - anatomical entity
	GO:0003674 - molecular_function
	BFO:0000002 - continuant
	PATO:0000001 - quality
	BFO:0000003 - occurrent
	NCBITaxon:1 - root
	PR:000018263 - amino acid chain
	PR:000021082 - bone marrow proteoglycan proteolytic cleavage product
 ... 113 more

This shows that the Cell Ontology is composed of selections from a number of bioontologies.

We can visualize relationships among terms using ontologyPlot. We've selected five different terms related to neuron cells, and the plotting function expands the number of terms displayed to illustrate context.

In [28]:
library(ontologyPlot)
onto_plot(ot, c("CL:0000095", "CL:0000679", "CL:2000028", "CL:1001509", "CL:2000031"))

Round trips: your results as annotation objects

We have some textual data in bedGraph format, produced by ENCODE. It gives regions and scores for ESRRA binding in the GM12878 B-cell line.

In [29]:
f1 = dir(system.file("extdata",package="ERBS"), full=TRUE)[1]
f1
readLines(f1, 4) # look at a few lines
'/home/ubuntu/R-3.4.3/library/ERBS/extdata/ENCFF001VEH.narrowPeak'
  1. 'chrX\t1509354\t1512462\t5\t0\t.\t157.92\t310\t32.000000\t1991'
  2. 'chrX\t26801421\t26802448\t6\t0\t.\t147.38\t310\t32.000000\t387'
  3. 'chr19\t11694101\t11695359\t1\t0\t.\t99.71\t311.66\t32.000000\t861'
  4. 'chr19\t4076892\t4079276\t4\t0\t.\t84.74\t310\t32.000000\t1508'

rtracklayer can be used to import this text to a GRanges instance.

In [30]:
library(rtracklayer)
bg1 = import.bedGraph(f1)
bg1
GRanges object with 1873 ranges and 7 metadata columns:
         seqnames               ranges strand |     score       NA.      NA.1
            <Rle>            <IRanges>  <Rle> | <numeric> <integer> <logical>
     [1]     chrX [ 1509355,  1512462]      * |         5         0      <NA>
     [2]     chrX [26801422, 26802448]      * |         6         0      <NA>
     [3]    chr19 [11694102, 11695359]      * |         1         0      <NA>
     [4]    chr19 [ 4076893,  4079276]      * |         4         0      <NA>
     [5]     chr3 [53288568, 53290767]      * |         9         0      <NA>
     ...      ...                  ...    ... .       ...       ...       ...
  [1869]    chr19 [11201120, 11203985]      * |      8701         0      <NA>
  [1870]    chr19 [ 2234920,  2237370]      * |       990         0      <NA>
  [1871]     chr1 [94311336, 94313543]      * |      4035         0      <NA>
  [1872]    chr19 [45690614, 45691210]      * |     10688         0      <NA>
  [1873]    chr19 [ 6110100,  6111252]      * |      2274         0      <NA>
              NA.2      NA.3      NA.4      NA.5
         <numeric> <numeric> <numeric> <integer>
     [1]    157.92       310        32      1991
     [2]    147.38       310        32       387
     [3]     99.71    311.66        32       861
     [4]     84.74       310        32      1508
     [5]      78.2   299.505        32      1772
     ...       ...       ...       ...       ...
  [1869]      8.65     7.281   0.26576      2496
  [1870]      8.65    26.258  1.995679      1478
  [1871]      8.65    12.511   1.47237      1848
  [1872]      8.65     6.205         0       298
  [1873]      8.65    17.356  2.013228       496
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

Note that the fields (with the exception of 'score') are unnamed, and the reference genome build is unspecified. The narrowPeak format document provides some interpretive help.

Clearly, if you had generated these results, you would want to clarify the interpretations of the fields and specify the reference genome build to which the addresses refer. Distributing your results as a GRanges would allow this binding of metadata to the experimental outcome.

In the event you need to transform data that you have as a GRanges to a textual format, the export method of rtracklayer can perform this.

In [31]:
td = tempdir()
tf = paste0(td, "/demoex.bed")
export(bg1, tf)  # implicit format choice
cat(readLines(tf, n=5), sep="\n")
chrX	1509354	1512462	.	5	.
chrX	26801421	26802448	.	6	.
chr19	11694101	11695359	.	1	.
chr19	4076892	4079276	.	4	.
chr3	53288567	53290767	.	9	.

We have carried out a “round trip” of importing, modeling, and exporting experimental data that can be integrated with other data to advance biological understanding.

What we have to watch out for is the idea that annotation is somehow permanently correct, isolated from the cacophony of research progress at the boundaries of knowledge. We have seen that even the reference sequences of human chromosomes are subject to revision.

Bioconductor has taken pains to acknowledge many facets of this situation. We maintain archives of prior versions of software and annotation so that past work can be checked or revised. We update central annotation resources twice a year so that there is stability for ongoing work as well as access to new knowledge. And we have made it simple to import and to create representations of experimental and annotation data.

AnnotationHub -- curated access to reference annotation

From the AnnotationHub vignette:

The AnnotationHub server provides easy R / Bioconductor access to large collections of publicly available whole genome resources, e.g,. ENSEMBL genome fasta or gtf files, UCSC chain resources, ENCODE data tracks at UCSC, etc.

We will get a general overview and then carry out a detailed query. We start by loading the package and obtaining a hub object.

In [32]:
library(AnnotationHub)
ah = AnnotationHub()
ah
snapshotDate(): 2017-10-27
AnnotationHub with 43025 records
# snapshotDate(): 2017-10-27 
# $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...
# $rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, ChainFile, OrgD...
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH2"]]' 

            title                                                 
  AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa     
  AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa  
  AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa  
  AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa            
  AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa          
  ...       ...                                                   
  AH60735 | Xenopus_tropicalis.JGI_4.2.ncrna.2bit                 
  AH60736 | Xiphophorus_maculatus.Xipmac4.4.2.cdna.all.2bit       
  AH60737 | Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.2bit
  AH60738 | Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.2bit
  AH60739 | Xiphophorus_maculatus.Xipmac4.4.2.ncrna.2bit          

Note that there is a specific snapshot date. The mcols method produces metadata about the various resources. The $ shortcut also works.

In [33]:
dim(mcols(ah))
  1. 43025
  2. 15

The rdataclass field of the metadata tells us what kinds of representations are available.

In [34]:
table(ah$rdataclass)
     AAStringSet       BigWigFile           biopax        ChainFile 
               1            10247                9             1113 
      data.frame            EnsDb           FaFile          GRanges 
              24              282             5122            19268 
   Inparanoid8Db             list           MSnSet         mzRident 
             268                3                1                1 
         mzRpwiz            OrgDb              Rle SQLiteConnection 
               1             1018             1586                1 
      TwoBitFile             TxDb          VcfFile 
            4027               45                8 

The query method

We can use the query method to look for records concerning resources available in the hub. An algorithm for labeling chromatin regions according to epigenetic state is ChromHmm and we search for records concerning its application.

In [35]:
q1 = query(ah, "ChromHmm")
q1
AnnotationHub with 128 records
# snapshotDate(): 2017-10-27 
# $dataprovider: BroadInstitute, UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5202"]]' 

            title                             
  AH5202  | Broad ChromHMM                    
  AH46856 | E001_15_coreMarks_mnemonics.bed.gz
  AH46857 | E002_15_coreMarks_mnemonics.bed.gz
  AH46858 | E003_15_coreMarks_mnemonics.bed.gz
  AH46859 | E004_15_coreMarks_mnemonics.bed.gz
  ...       ...                               
  AH46978 | E125_15_coreMarks_mnemonics.bed.gz
  AH46979 | E126_15_coreMarks_mnemonics.bed.gz
  AH46980 | E127_15_coreMarks_mnemonics.bed.gz
  AH46981 | E128_15_coreMarks_mnemonics.bed.gz
  AH46982 | E129_15_coreMarks_mnemonics.bed.gz

Metadata related to this query can be displayed for deeper searching.

In [36]:
datatable(as.data.frame(mcols(q1)))
datatables

Retrieval

Recall the rdataclass query above.

In [37]:
table(ah$rdataclass)
     AAStringSet       BigWigFile           biopax        ChainFile 
               1            10247                9             1113 
      data.frame            EnsDb           FaFile          GRanges 
              24              282             5122            19268 
   Inparanoid8Db             list           MSnSet         mzRident 
             268                3                1                1 
         mzRpwiz            OrgDb              Rle SQLiteConnection 
               1             1018             1586                1 
      TwoBitFile             TxDb          VcfFile 
            4027               45                8 

What do the VCF files represent?

In [38]:
datatable(as.data.frame(mcols(ah)[which(ah$rdataclass=="VcfFile"),]))
datatables

Let's obtain the first ClinVar file. We use [[ with the tag. If we have not done this previously, the resource is added to our AnnotationHub cache.

In [39]:
cv1 = ah[["AH57956"]]
cv1
require("VariantAnnotation")
loading from cache '/home/ubuntu//.AnnotationHub/64694'
    '/home/ubuntu//.AnnotationHub/64695'
class: VcfFile 
path: /home/ubuntu//.AnnotationHub/64694
index: /home/ubuntu//.AnnotationHub/64695
isOpen: FALSE 
yieldSize: NA 

We'll look at the header for some hints.

In [40]:
cv1h = scanVcfHeader(cv1)
cv1h
class: VCFHeader 
samples(0):
meta(1): META
fixed(1): FILTER
info(58): RS RSPOS ... CLNREVSTAT CLNACC
geno(0):

The META component is usually informative.

In [41]:
meta(cv1h)$META
DataFrame with 6 rows and 1 column
                                                                                             Value
                                                                                       <character>
fileformat                                                                                 VCFv4.0
fileDate                                                                                  20160203
source                                                                           ClinVar and dbSNP
reference                                                                               GRCh37.p13
phasing                                                                                    partial
variationPropertyDocumentationUrl ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf\t

We'll now use readVcf to extract content on a region of interest.

In [42]:
rv1 = readVcf(cv1, genome="GRCh37.p13", param=ScanVcfParam(which=GRanges("1", IRanges(949000, 996000))))
rv1
class: CollapsedVCF 
dim: 42 0 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 58 columns: RS, RSPOS, RV, VP, GENEINFO, dbSNPBuildID, SAO,...
info(header(vcf)):
                Number Type    Description                                     
   RS           1      Integer dbSNP ID (i.e. rs number)                       
   RSPOS        1      Integer Chr position reported in dbSNP                  
   RV           0      Flag    RS orientation is reversed                      
   VP           1      String  Variation Property.  Documentation is at ftp:...
   GENEINFO     1      String  Pairs each of gene symbol:gene id.  The gene ...
   dbSNPBuildID 1      Integer First dbSNP Build for RS                        
   SAO          1      Integer Variant Allele Origin: 0 - unspecified, 1 - G...
   SSR          1      Integer Variant Suspect Reason Codes (may be more tha...
   WGT          1      Integer Weight, 00 - unmapped, 1 - weight 1, 2 - weig...
   VC           1      String  Variation Class                                 
   PM           0      Flag    Variant is Precious(Clinical,Pubmed Cited)      
   TPA          0      Flag    Provisional Third Party Annotation(TPA) (curr...
   PMC          0      Flag    Links exist to PubMed Central article           
   S3D          0      Flag    Has 3D structure - SNP3D table                  
   SLO          0      Flag    Has SubmitterLinkOut - From SNP->SubSNP->Batc...
   NSF          0      Flag    Has non-synonymous frameshift A coding region...
   NSM          0      Flag    Has non-synonymous missense A coding region v...
   NSN          0      Flag    Has non-synonymous nonsense A coding region v...
   REF          0      Flag    Has reference A coding region variation where...
   SYN          0      Flag    Has synonymous A coding region variation wher...
   U3           0      Flag    In 3' UTR Location is in an untranslated regi...
   U5           0      Flag    In 5' UTR Location is in an untranslated regi...
   ASS          0      Flag    In acceptor splice site FxnCode = 73            
   DSS          0      Flag    In donor splice-site FxnCode = 75               
   INT          0      Flag    In Intron FxnCode = 6                           
   R3           0      Flag    In 3' gene region FxnCode = 13                  
   R5           0      Flag    In 5' gene region FxnCode = 15                  
   OTH          0      Flag    Has other variant with exactly the same set o...
   CFL          0      Flag    Has Assembly conflict. This is for weight 1 a...
   ASP          0      Flag    Is Assembly specific. This is set if the vari...
   MUT          0      Flag    Is mutation (journal citation, explicit fact)...
   VLD          0      Flag    Is Validated.  This bit is set if the variant...
   G5A          0      Flag    >5% minor allele frequency in each and all po...
   G5           0      Flag    >5% minor allele frequency in 1+ populations    
   HD           0      Flag    Marker is on high density genotyping kit (50K...
   GNO          0      Flag    Genotypes available. The variant has individu...
   KGPhase1     0      Flag    1000 Genome phase 1 (incl. June Interim phase 1)
   KGPhase3     0      Flag    1000 Genome phase 3                             
   CDA          0      Flag    Variation is interrogated in a clinical diagn...
   LSD          0      Flag    Submitted from a locus-specific database        
   MTP          0      Flag    Microattribution/third-party annotation(TPA:G...
   OM           0      Flag    Has OMIM/OMIA                                   
   NOC          0      Flag    Contig allele not present in variant allele l...
   WTD          0      Flag    Is Withdrawn by submitter If one member ss is...
   NOV          0      Flag    Rs cluster has non-overlapping allele sets. T...
   CAF          .      String  An ordered, comma delimited list of allele fr...
   COMMON       1      Integer RS is a common SNP.  A common SNP is one that...
   CLNHGVS      .      String  Variant names from HGVS.    The order of thes...
   CLNALLE      .      Integer Variant alleles from REF or ALT columns.  0 i...
   CLNSRC       .      String  Variant Clinical Chanels                        
   CLNORIGIN    .      String  Allele Origin. One or more of the following v...
   CLNSRCID     .      String  Variant Clinical Channel IDs                    
   CLNSIG       .      String  Variant Clinical Significance, 0 - Uncertain ...
   CLNDSDB      .      String  Variant disease database name                   
   CLNDSDBID    .      String  Variant disease database ID                     
   CLNDBN       .      String  Variant disease name                            
   CLNREVSTAT   .      String  no_assertion - No assertion provided, no_crit...
   CLNACC       .      String  Variant Accession and Versions                  
geno(vcf):
  SimpleList of length 0: 

We now have metadata on 42 variants in the region of interest we specified. The addresses are:

In [43]:
rowRanges(rv1)
GRanges object with 42 ranges and 5 metadata columns:
              seqnames           ranges strand | paramRangeID            REF
                 <Rle>        <IRanges>  <Rle> |     <factor> <DNAStringSet>
  rs786201005        1 [949523, 949523]      * |         <NA>              C
  rs672601345        1 [949696, 949696]      * |         <NA>              C
  rs672601312        1 [949739, 949739]      * |         <NA>              G
  rs115173026        1 [955597, 955597]      * |         <NA>              G
  rs201073369        1 [955619, 955619]      * |         <NA>              G
          ...      ...              ...    ... .          ...            ...
   rs17160781        1 [986737, 986737]      * |         <NA>              T
   rs17778478        1 [987142, 987142]      * |         <NA>              C
    rs9803031        1 [987200, 987200]      * |         <NA>              C
   rs74685771        1 [989207, 989207]      * |         <NA>              G
    rs4275402        1 [990280, 990280]      * |         <NA>              C
                             ALT      QUAL      FILTER
              <DNAStringSetList> <numeric> <character>
  rs786201005                  T      <NA>           .
  rs672601345                 CG      <NA>           .
  rs672601312                  T      <NA>           .
  rs115173026                  T      <NA>           .
  rs201073369                  C      <NA>           .
          ...                ...       ...         ...
   rs17160781                  C      <NA>           .
   rs17778478                  T      <NA>           .
    rs9803031                  T      <NA>           .
   rs74685771              A,C,T      <NA>           .
    rs4275402                  T      <NA>           .
  -------
  seqinfo: 1 sequence from GRCh37.p13 genome; no seqlengths

The info metadata component can be picked apart into sets of columns. The first 10 columns are basic annotation.

In [44]:
info(rv1)[,1:10]
DataFrame with 42 rows and 10 columns
                   RS     RSPOS        RV                         VP
            <integer> <integer> <logical>                <character>
rs786201005 786201005    949523     FALSE 0x050060000605000002110100
rs672601345 672601345    949699     FALSE 0x050060001205000002110200
rs672601312 672601312    949739     FALSE 0x050060000605000002110100
rs115173026 115173026    955597     FALSE 0x050060000305170036100100
rs201073369 201073369    955619     FALSE 0x050000000a05040026000100
...               ...       ...       ...                        ...
rs17160781   17160781    986737     FALSE 0x050360000305150536100100
rs17778478   17778478    987142     FALSE 0x050260000305040536100100
rs9803031     9803031    987200     FALSE 0x05016008000515053f100100
rs74685771   74685771    989207     FALSE 0x050260000a05050436100104
rs4275402     4275402    990280     FALSE 0x05036000030517053f100100
               GENEINFO dbSNPBuildID       SAO       SSR       WGT          VC
            <character>    <integer> <integer> <integer> <integer> <character>
rs786201005  ISG15:9636          144         1         0         1         SNV
rs672601345  ISG15:9636          142         1         0         1         DIV
rs672601312  ISG15:9636          142         1         0         1         SNV
rs115173026 AGRN:375790          132         1         0         1         SNV
rs201073369 AGRN:375790          137         0         0         1         SNV
...                 ...          ...       ...       ...       ...         ...
rs17160781  AGRN:375790          123         1         0         1         SNV
rs17778478  AGRN:375790          123         1         0         1         SNV
rs9803031   AGRN:375790          119         1         0         1         SNV
rs74685771  AGRN:375790          132         1         0         1         SNV
rs4275402   AGRN:375790          111         1         0         1         SNV

The last nine columns are for clinical genetics interpretation.

In [45]:
info(rv1)[,50:58]
DataFrame with 42 rows and 9 columns
                          CLNSRC       CLNORIGIN        CLNSRCID
                 <CharacterList> <CharacterList> <CharacterList>
rs786201005 OMIM_Allelic_Variant               1     147571.0003
rs672601345 OMIM_Allelic_Variant               1     147571.0002
rs672601312 OMIM_Allelic_Variant               1     147571.0001
rs115173026                    .               1               .
rs201073369                    .               1               .
...                          ...             ...             ...
rs17160781                     .               1               .
rs17778478                     .               1               .
rs9803031                      .               1               .
rs74685771                     .               1               .
rs4275402                      .               1               .
                     CLNSIG              CLNDSDB                   CLNDSDBID
            <CharacterList>      <CharacterList>             <CharacterList>
rs786201005               5 MedGen:OMIM:Orphanet CN221808:616126:ORPHA319563
rs672601345               5 MedGen:OMIM:Orphanet CN221808:616126:ORPHA319563
rs672601312               5 MedGen:OMIM:Orphanet CN221808:616126:ORPHA319563
rs115173026               2               MedGen                    CN169374
rs201073369               0               MedGen                    CN169374
...                     ...                  ...                         ...
rs17160781                3               MedGen                    CN169374
rs17778478                3               MedGen                    CN169374
rs9803031                 2               MedGen                    CN169374
rs74685771                3               MedGen                    CN169374
rs4275402                 2               MedGen                    CN169374
                         CLNDBN      CLNREVSTAT          CLNACC
                <CharacterList> <CharacterList> <CharacterList>
rs786201005 Immunodeficiency_38     no_criteria  RCV000162196.3
rs672601345 Immunodeficiency_38     no_criteria  RCV000148989.5
rs672601312 Immunodeficiency_38     no_criteria  RCV000148988.5
rs115173026       not_specified          single  RCV000116272.3
rs201073369       not_specified          single  RCV000193277.1
...                         ...             ...             ...
rs17160781        not_specified     no_criteria  RCV000116277.2
rs17778478        not_specified     no_criteria  RCV000116278.2
rs9803031         not_specified          single  RCV000116279.1
rs74685771        not_specified     no_criteria  RCV000116280.2
rs4275402         not_specified          single  RCV000116281.1

To get more information like the link for the variation property bitfield you can use the info accessor on the header.

In [46]:
info(header(rv1))
DataFrame with 58 rows and 3 columns
                Number        Type
           <character> <character>
RS                   1     Integer
RSPOS                1     Integer
RV                   0        Flag
VP                   1      String
GENEINFO             1      String
...                ...         ...
CLNDSDB              .      String
CLNDSDBID            .      String
CLNDBN               .      String
CLNREVSTAT           .      String
CLNACC               .      String
                                                                                                                                                                                                                                                                                                                Description
                                                                                                                                                                                                                                                                                                                <character>
RS                                                                                                                                                                                                                                                                                                dbSNP ID (i.e. rs number)
RSPOS                                                                                                                                                                                                                                                                                        Chr position reported in dbSNP
RV                                                                                                                                                                                                                                                                                               RS orientation is reversed
VP                                                                                                                                                                                                                  Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
GENEINFO                                                                                                                                                                           Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)
...                                                                                                                                                                                                                                                                                                                     ...
CLNDSDB                                                                                                                                                                                                                                                                                       Variant disease database name
CLNDSDBID                                                                                                                                                                                                                                                                                       Variant disease database ID
CLNDBN                                                                                                                                                                                                                                                                                                 Variant disease name
CLNREVSTAT no_assertion - No assertion provided, no_criteria - No assertion criteria provided, single - Criteria provided single submitter, mult - Criteria provided multiple submitters no conflicts, conf - Criteria provided conflicting interpretations, exp - Reviewed by expert panel, guideline - Practice guideline
CLNACC                                                                                                                                                                                                                                                                                       Variant Accession and Versions

AnnotationHub roundup

In summary, AnnotationHub can harbor detailed information on many aspects of modern systems biology. Once you have identified something useful, it will be downloaded to your local system and cached, so that the retrieval need not recur. If the hub version is updated, the cache will be checked and refreshed as needed.

Annotation roundup

This has been a wide-ranging review of genomic annotation resources in Bioconductor. I'll annotate the road map with key functions and packages.

General overview of Bioconductor annotation
  Levels: reference sequence, regions of interest, pathways
     -- BSgenome.Hsapiens.UCSC.hg19, genes(Homo.sapiens)
  OrgDb: simple interface to annotation databases
     -- org.Hs.eg.db
  Discovering reference sequence
     -- available.genomes
  A build of the human genome
     -- Hsapiens$chr17
  Gene/Transcript/Exon catalogs from UCSC and Ensembl
     -- TxDb.Hsapiens.UCSC.hg19.knownGene, EnsDb.Hsapiens.v75
  Finding and managing gene sets
     -- GSEABase, KEGGREST

Specific annotation concerns
  Ontology concepts and tools
     -- ontoProc, ontologyPlot, rols
  Importing and exporting regions and scores
     -- rtracklayer
  AnnotationHub: brokering thousands of annotation resources
     -- AnnotationHub, query(), ah[[...]]