Brown Univ. Introduction to Bioconductor 2018, Period 2

Genomic ranges for organizing and interrogating genome-scale data

Period 2 has the following basic outline. We want to understand the basic IRanges and GRanges infrastructure components and then use them to organize and interrogate genomic experiments.

Period II. Working with general genomic features using GenomicRanges
  IRanges introduced
  Intra-range operations
  Inter-range operations
  GRanges
  Calculating overlaps
Range-oriented solutions for current experimental paradigms
  GenomicFiles for families of BAM or BED
  SummarizedExperiment: for RNA-seq
In [1]:
suppressPackageStartupMessages({  # basic setup tasks
    library(IRanges)
    library(Homo.sapiens)
    library(GenomicRanges)
    library(gwascat)
    library(BiocStyle)
    library(RNAseqData.HNRNPC.bam.chr14)
    library(GenomicFiles)
    library(erma)
    library(rtracklayer)
    })

Introducing IRanges

The IRanges package is fundamental infrastructure for Bioconductor. It lies at the heart of much genomic annotation and data representation.

The schematic diagram below should be read from the bottom up. The horizontal scale can be regarded as genomic base positions.

Intra-range operations

We are working with the positions in the interval [5, 10], shaded pink above. We will learn how to interpret the methods shift, narrow, flank, resize, and various arithmetic operations.

We create our basic IRanges instance:

In [2]:
ir = IRanges(5, 10)
ir
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         5        10         6

Now function calls for selected 'intra-range' operations.

In [3]:
shift(ir, -2)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3         8         6
In [4]:
resize(ir, 1)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         5         5         1

Multi-range objects

We can create a family of ranges using vector inputs to the IRanges method.

In [5]:
ir <- IRanges(c(3, 8, 14, 15, 19, 34, 40),
  width = c(12, 6, 6, 15, 6, 2, 7))
ir
IRanges object with 7 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        14        12
  [2]         8        13         6
  [3]        14        19         6
  [4]        15        29        15
  [5]        19        24         6
  [6]        34        35         2
  [7]        40        46         7

This range set is displayed in the figure below. The intra-range operations will be applied elementwise.

In [6]:
resize(ir,1)  # leftmost width-1 position
IRanges object with 7 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3         3         1
  [2]         8         8         1
  [3]        14        14         1
  [4]        15        15         1
  [5]        19        19         1
  [6]        34        34         1
  [7]        40        40         1

Inter-range operations

Information about inter-range operations can be obtained using ?"inter-range-methods". For example, for a multi-range instance ir, reduce(ir) produces a new IRanges instance representing the merging of all locations occupied by any range.

In [7]:
reduce(ir)
IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        29        27
  [2]        34        35         2
  [3]        40        46         7

Metadata and indexing for ranges

We can give names to ranges, associate multiple fields of metadata to each range (using mcols), and use bracket-style indexing.

In [8]:
names(ir) = letters[1:7]
ir[c("a", "d")]
IRanges object with 2 ranges and 0 metadata columns:
        start       end     width
    <integer> <integer> <integer>
  a         3        14        12
  d        15        29        15

The association of a collection of attributes with each range is very useful for genomic annotation. Any conformant data.frame instance (one row per range) can be used with the mcols assignment operation to annotate a set of ranges. We use some attributes of the mtcars dataset for illustration.

In [9]:
mcols(ir) = mtcars[1:7,1:3]
ir
IRanges object with 7 ranges and 3 metadata columns:
        start       end     width |       mpg       cyl      disp
    <integer> <integer> <integer> | <numeric> <numeric> <numeric>
  a         3        14        12 |        21         6       160
  b         8        13         6 |        21         6       160
  c        14        19         6 |      22.8         4       108
  d        15        29        15 |      21.4         6       258
  e        19        24         6 |      18.7         8       360
  f        34        35         2 |      18.1         6       225
  g        40        46         7 |      14.3         8       360
In [10]:
resize(ir,1) # metadata are propagated for intra-range operations
IRanges object with 7 ranges and 3 metadata columns:
        start       end     width |       mpg       cyl      disp
    <integer> <integer> <integer> | <numeric> <numeric> <numeric>
  a         3         3         1 |        21         6       160
  b         8         8         1 |        21         6       160
  c        14        14         1 |      22.8         4       108
  d        15        15         1 |      21.4         6       258
  e        19        19         1 |      18.7         8       360
  f        34        34         1 |      18.1         6       225
  g        40        40         1 |      14.3         8       360
In [11]:
gaps(ir) # not for inter-range operations
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        30        33         4
  [2]        36        39         4

IRanges is the name of a formal class, and we can enumerate all known methods on this class:

In [12]:
length(methods(class="IRanges"))
303

Clearly there is substantial infrastructure defined for this concept. The roles of some of these methods in genome-scale analysis becomes clearer in the next section.

Exercises

The ir object that combines ranges with some information on car models is completely contrived but the following questions should be manageable.

Write the code to determine the maximum width of ranges corresponding to 6 cylinder cars.

In [13]:
max(width(ir[mcols(ir)$cyl==6]))
15

Show that the average value of miles per gallon for cars corresponding to ranges with start values greater than 14 is 18.125.

Finding overlapping regions

The problem of finding overlaps among collections of intervals is commonly encountered. There are various nuances covered in the documentation. We'll use our collection of intervals ir to illustrate the basic idea. We'll break the set into two groups.

In [14]:
ir1 = ir[c(3,7)]
ir2 = ir[-c(3,7)]
ir1
IRanges object with 2 ranges and 3 metadata columns:
        start       end     width |       mpg       cyl      disp
    <integer> <integer> <integer> | <numeric> <numeric> <numeric>
  c        14        19         6 |      22.8         4       108
  g        40        46         7 |      14.3         8       360
In [15]:
ir2
IRanges object with 5 ranges and 3 metadata columns:
        start       end     width |       mpg       cyl      disp
    <integer> <integer> <integer> | <numeric> <numeric> <numeric>
  a         3        14        12 |        21         6       160
  b         8        13         6 |        21         6       160
  d        15        29        15 |      21.4         6       258
  e        19        24         6 |      18.7         8       360
  f        34        35         2 |      18.1         6       225

We then use findOverlaps, which has two obligatory arguments: query, and subject.

In [16]:
fo = findOverlaps(ir1, ir2) # ir1 is query, ir2 is subject
fo
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           3
  [3]         1           4
  -------
  queryLength: 2 / subjectLength: 5

The elements of ir2 that were overlapped by ranges in ir1 are listed as follows:

In [17]:
ir2[subjectHits(fo)]
IRanges object with 3 ranges and 3 metadata columns:
        start       end     width |       mpg       cyl      disp
    <integer> <integer> <integer> | <numeric> <numeric> <numeric>
  a         3        14        12 |        21         6       160
  d        15        29        15 |      21.4         6       258
  e        19        24         6 |      18.7         8       360

A logical vector indicating which elements of ir1 overlapped elements of ir2 can be computed as

In [18]:
ir1 %over% ir2
  1. TRUE
  2. FALSE

GRanges to handle the context of genomic coordinates

Base positions and intervals on genomic sequences can be modeled using IRanges, but it is essential to add metadata that establish a number of contextual details. It is typical to maintain information about chromosome identity and chromosome length, along with labels for genome build and origin. We saw one example early on: apply the genes method to Homo.sapiens.

In [19]:
library(Homo.sapiens)
hg = genes(Homo.sapiens)
hg
GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand |       GENEID
           <Rle>              <IRanges>  <Rle> | <FactorList>
      1    chr19 [ 58858172,  58874214]      - |            1
     10     chr8 [ 18248755,  18258723]      + |           10
    100    chr20 [ 43248163,  43280376]      - |          100
   1000    chr18 [ 25530930,  25757445]      - |         1000
  10000     chr1 [243651535, 244006886]      - |        10000
    ...      ...                    ...    ... .          ...
   9991     chr9 [114979995, 115095944]      - |         9991
   9992    chr21 [ 35736323,  35743440]      + |         9992
   9993    chr22 [ 19023795,  19109967]      - |         9993
   9994     chr6 [ 90539619,  90584155]      + |         9994
   9997    chr22 [ 50961997,  50964905]      - |         9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

There is an obligatory metadata construct called seqnames that gives the chromosome occupied by the gene whose start and end positions are modeled by the associated IRanges. Strand is also recorded.

Plus strand features have the biological direction from left to right on the number line, and minus strand features have the biological direction from right to left. In terms of the IRanges, plus strand features go from start to end, and minus strand features go from end to start. This is required because width is defined as end - start + 1, and negative width ranges are not allowed. Because DNA has two strands, which have an opposite directionality, strand is necessary for uniquely referring to DNA.

Strand may have values +, -, or * for unspecified. seqinfo collects information on the chromosome names, lengths, circularity, and reference build.

Mildly advanced exercise

You might find the ordering of ranges shown above unpleasant. We can order them by chromosome:

In [20]:
hg[order(seqnames(hg))]
GRanges object with 23056 ranges and 1 metadata column:
                  seqnames                 ranges strand |       GENEID
                     <Rle>              <IRanges>  <Rle> | <FactorList>
      10000           chr1 [243651535, 244006886]      - |        10000
  100126331           chr1 [117637265, 117637350]      + |    100126331
  100126346           chr1 [154166141, 154166219]      - |    100126346
  100126348           chr1 [ 94312388,  94312467]      + |    100126348
  100126349           chr1 [166123980, 166124035]      - |    100126349
        ...            ...                    ...    ... .          ...
     283788 chrUn_gl000219       [ 53809,  99642]      - |       283788
  100507412 chrUn_gl000220       [ 97129, 126696]      + |    100507412
  100288687 chrUn_gl000228       [112605, 124171]      + |    100288687
  100653046 chrUn_gl000228       [ 86060,  90745]      + |    100653046
     728410 chrUn_gl000228       [ 79448, 113887]      + |       728410
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Add code that orders the ranges by starting position within chromosome.

Vector operations

GRanges can be treated as any standard vector.

In [21]:
hg[1:4] # first four in the lexical ordering of `names(hg)`
GRanges object with 4 ranges and 1 metadata column:
       seqnames               ranges strand |       GENEID
          <Rle>            <IRanges>  <Rle> | <FactorList>
     1    chr19 [58858172, 58874214]      - |            1
    10     chr8 [18248755, 18258723]      + |           10
   100    chr20 [43248163, 43280376]      - |          100
  1000    chr18 [25530930, 25757445]      - |         1000
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [22]:
sort(hg)[1:4]  # physical ordering on plus strand
GRanges object with 4 ranges and 1 metadata column:
            seqnames           ranges strand |       GENEID
               <Rle>        <IRanges>  <Rle> | <FactorList>
  100287102     chr1 [ 11874,  14409]      + |    100287102
      79501     chr1 [ 69091,  70008]      + |        79501
     643837     chr1 [762971, 794826]      + |       643837
     148398     chr1 [860530, 879961]      + |       148398
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [23]:
savestrand = strand(hg)
strand(hg) = "*"
sort(hg)[1:4] # different!
GRanges object with 4 ranges and 1 metadata column:
            seqnames           ranges strand |       GENEID
               <Rle>        <IRanges>  <Rle> | <FactorList>
  100287102     chr1 [ 11874,  14409]      * |    100287102
     653635     chr1 [ 14362,  29961]      * |       653635
      79501     chr1 [ 69091,  70008]      * |        79501
     729737     chr1 [134773, 140566]      * |       729737
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [24]:
strand(hg) = savestrand  # restore

Multichromosome context

seqinfo is an important method for/component of well-annotated GenomicRanges instances.

In [25]:
seqinfo(hg)
Seqinfo object with 93 sequences (1 circular) from hg19 genome:
  seqnames       seqlengths isCircular genome
  chr1            249250621       <NA>   hg19
  chr2            243199373       <NA>   hg19
  chr3            198022430       <NA>   hg19
  chr4            191154276       <NA>   hg19
  chr5            180915260       <NA>   hg19
  ...                   ...        ...    ...
  chrUn_gl000245      36651       <NA>   hg19
  chrUn_gl000246      38154       <NA>   hg19
  chrUn_gl000247      36422       <NA>   hg19
  chrUn_gl000248      39786       <NA>   hg19
  chrUn_gl000249      38502       <NA>   hg19
In [26]:
sum(isCircular(hg), na.rm=TRUE) # how many circular chromosomes?
1
In [27]:
seqinfo(hg)["chrM"]
Seqinfo object with 1 sequence (1 circular) from hg19 genome:
  seqnames seqlengths isCircular genome
  chrM          16571       TRUE   hg19
In [28]:
# table(seqnames(hg)) # counts of genes per chromosome (or random/unmapped contig)
In [29]:
hg[ which(seqnames(hg)=="chr22") ]
GRanges object with 535 ranges and 1 metadata column:
            seqnames               ranges strand |       GENEID
               <Rle>            <IRanges>  <Rle> | <FactorList>
  100037417    chr22 [24309026, 24314748]      + |    100037417
  100126318    chr22 [22007270, 22007347]      + |    100126318
  100128531    chr22 [25498384, 25508659]      - |    100128531
  100128946    chr22 [49262582, 49294198]      + |    100128946
  100130418    chr22 [17517460, 17539682]      + |    100130418
        ...      ...                  ...    ... .          ...
       9889    chr22 [50247497, 50283726]      + |         9889
       9929    chr22 [39081548, 39096459]      - |         9929
       9978    chr22 [41347351, 41369019]      + |         9978
       9993    chr22 [19023795, 19109967]      - |         9993
       9997    chr22 [50961997, 50964905]      - |         9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [30]:
hgs = keepStandardChromosomes(hg, pruning.mode="coarse") # eliminate random/unmapped
hgs
GRanges object with 23033 ranges and 1 metadata column:
        seqnames                 ranges strand |       GENEID
           <Rle>              <IRanges>  <Rle> | <FactorList>
      1    chr19 [ 58858172,  58874214]      - |            1
     10     chr8 [ 18248755,  18258723]      + |           10
    100    chr20 [ 43248163,  43280376]      - |          100
   1000    chr18 [ 25530930,  25757445]      - |         1000
  10000     chr1 [243651535, 244006886]      - |        10000
    ...      ...                    ...    ... .          ...
   9991     chr9 [114979995, 115095944]      - |         9991
   9992    chr21 [ 35736323,  35743440]      + |         9992
   9993    chr22 [ 19023795,  19109967]      - |         9993
   9994     chr6 [ 90539619,  90584155]      + |         9994
   9997    chr22 [ 50961997,  50964905]      - |         9997
  -------
  seqinfo: 25 sequences (1 circular) from hg19 genome
In [31]:
table(seqnames(hgs))
 chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
 2326  1464  1271   873  1022  1000  1108   818   945   903  1439  1173   449 
chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY  chrM 
  779   791   938  1337   341  1607   647   296   535   918    53     0 

GRangesList for grouped genomic elements

Exons are elements of gene models. The exons method gives a flat sequence of GRanges recording exon positions. exonsBy organizes the exons into genes, yielding a special structure called GRangesList.

In [32]:
ebg = exonsBy(Homo.sapiens, by = "gene")
ebg
GRangesList object of length 23459:
$1 
GRanges object with 15 ranges and 2 metadata columns:
       seqnames               ranges strand |   exon_id   exon_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr19 [58858172, 58858395]      - |    250809        <NA>
   [2]    chr19 [58858719, 58859006]      - |    250810        <NA>
   [3]    chr19 [58859832, 58860494]      - |    250811        <NA>
   [4]    chr19 [58860934, 58862017]      - |    250812        <NA>
   [5]    chr19 [58861736, 58862017]      - |    250813        <NA>
   ...      ...                  ...    ... .       ...         ...
  [11]    chr19 [58868951, 58869015]      - |    250821        <NA>
  [12]    chr19 [58869318, 58869652]      - |    250822        <NA>
  [13]    chr19 [58869855, 58869951]      - |    250823        <NA>
  [14]    chr19 [58870563, 58870689]      - |    250824        <NA>
  [15]    chr19 [58874043, 58874214]      - |    250825        <NA>

$10 
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]     chr8 [18248755, 18248855]      + |  113603      <NA>
  [2]     chr8 [18257508, 18258723]      + |  113604      <NA>

...
<23457 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

Exercise

Note that the unlist(ebg) is efficiently computed and retains the gene entrez id as the name of each exon range. Give the entrez id of the gene with the longest exon.

In [33]:
# keepStandardChromosomes(ebg, pruning.mode="coarse")

Strand-aware operations

Here we present code that helps visualize strand-awareness of GRanges operations. We will assign strands for the ranges used above and then plot the results of operations that could represent isolation of transcription-start sites (resize(...,1)), identifying upstream promoter regions (flank(...,[len])) or downstream promoter regions (flank, ..., len, start=FALSE).

In [34]:
plotGRanges = function (x, xlim = x, col = "black", sep = 0.5, xlimits = c(0, 
    60), ...) 
{
    main = deparse(substitute(x))
    ch = as.character(seqnames(x)[1])
    x = ranges(x)
    height <- 1
    if (is(xlim, "Ranges")) 
        xlim <- c(min(start(xlim)), max(end(xlim)))
    bins <- disjointBins(IRanges(start(x), end(x) + 1)) 
    plot.new()
    plot.window(xlim = xlimits, c(0, max(bins) * (height + sep)))
    ybottom <- bins * (sep + height) - height
    rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, 
        col = col, ...)
    title(main, xlab = ch) 
    axis(1)
}

Now we can set up the GRanges and strand information to visualize the various elements.

In [35]:
par(mfrow=c(4,1), mar=c(4,2,2,2))
library(GenomicRanges)
gir = GRanges(seqnames="chr1", ir, strand=c(rep("+", 4), rep("-",3)))
plotGRanges(gir, xlim=c(0,60))
plotGRanges(resize(gir,1), xlim=c(0,60),col="green")
plotGRanges(flank(gir,3), xlim=c(0,60), col="purple")
plotGRanges(flank(gir,2,start=FALSE), xlim=c(0,60), col="brown")

An application of findOverlaps with the GWAS catalog

Genome-wide association studies (GWAS) are systematically catalogued in a resource managed at EMBL/EBI. We can retrieve a version of the catalog using the gwascat package.

In [36]:
library(gwascat)
data(ebicat37)
ebicat37
gwasloc instance with 22688 records and 36 attributes per record.
Extracted:   
Genome:  GRCh37 
Excerpt:
GRanges object with 5 ranges and 3 metadata columns:
      seqnames                 ranges strand |                  DISEASE/TRAIT
         <Rle>              <IRanges>  <Rle> |                    <character>
  [1]    chr11 [ 41820450,  41820450]      * | Post-traumatic stress disorder
  [2]    chr15 [ 35060463,  35060463]      * | Post-traumatic stress disorder
  [3]     chr8 [ 97512977,  97512977]      * | Post-traumatic stress disorder
  [4]     chr9 [100983826, 100983826]      * | Post-traumatic stress disorder
  [5]    chr15 [ 54715642,  54715642]      * | Post-traumatic stress disorder
             SNPS   P-VALUE
      <character> <numeric>
  [1]  rs10768747     5e-06
  [2]  rs12232346     2e-06
  [3]   rs2437772     6e-06
  [4]   rs7866350     1e-06
  [5]  rs73419609     6e-06
  -------
  seqinfo: 23 sequences from GRCh37 genome

There is one range per GWAS finding. Each range corresponds to a SNP for which an association with a given phenotype is statistically significant, and is replicated. The mcols record many types of data describing the study and the SNP.

Notice that the genome is labelled "GRCh37". This is very similar to hg19. We'll simply alter the tag. If we did not, we would encounter an 'incompatible genomes error'.

In [37]:
genome(ebicat37) = "hg19"

Now we will use findOverlaps to determine which genes have annotated intervals (we'll unpack this concept later) that overlap GWAS hits.

In [38]:
goh = findOverlaps(hg, ebicat37)
goh
Hits object with 12934 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         2         140
      [2]         4       10865
      [3]         4       16003
      [4]         5        6746
      [5]         5        7962
      ...       ...         ...
  [12930]     23036       11148
  [12931]     23036       11149
  [12932]     23036       22003
  [12933]     23048       19309
  [12934]     23050       21478
  -------
  queryLength: 23056 / subjectLength: 22688

We obtain an instance of the Hits class, which records the indices of query ranges and subject ranges that satisfy the default overlap condition.

In [39]:
length(unique(queryHits(goh)))
4720

We have counted the number of genes that overlap one or more GWAS hits. To determine the proportion of SNPs with GWAS associations that lie within gene regions:

In [40]:
mean(reduce(ebicat37) %over% hg)
0.511465040099401

The same pattern can be used to estimate the proportion of SNP that are GWAS hits lying in exons:

In [41]:
mean(reduce(ebicat37) %over% exons(Homo.sapiens))
0.07059753755789

Using our crude concept of promoter region (uniform in size for each gene, see ?promoters after attaching GenomicRanges), we can estimate the proportion of SNP that are GWAS hits lying in promoters.

In [42]:
pr = suppressWarnings(promoters(hg, upstream=20000)) # will slip over edge of some chromosomes
pr = keepStandardChromosomes(pr, pruning.mode="coarse")
mean(reduce(ebicat37) %over% pr)
0.173669942392409

You can use other Bioconductor annotation resources to obtain a more refined definition of 'promoter'.

Working with GRanges and assay collections

A collection of bed files

The erma package was created to demonstrate the use of the GenomicFiles discipline for interactive computing with collections of BED files. (erma abbreviates Epigenomic RoadMap Adventures.) The data have been selected from the roadmap portal, specifically this folder.

These BED files have tabix-indexed. This allows accelerated retrieval of data lying in specified genomic intervals.

We begin by attaching the package and creating an ErmaSet instance.

In [43]:
library(erma)
erset = makeErmaSet()
erset
NOTE: input data had non-ASCII characters replaced by ' '.
ErmaSet object with 0 ranges and 31 files: 
files: E002_25_imputed12marks_mnemonics.bed.gz, E003_25_imputed12marks_mnemonics.bed.gz, ..., E088_25_imputed12marks_mnemonics.bed.gz, E096_25_imputed12marks_mnemonics.bed.gz 
detail: use files(), rowRanges(), colData(), ... 
cellTypes() for type names; data(short_celltype) for abbr.

The colData method extracts information concerning the different files. In this case, each file corresponds to the classification of genomic intervals in a given cell type according to the chromatin state labeling algorithm 'ChromHmm'.

In the following cell, we will use a nice method of rendering large tables when using HTML. The DT package includes bindings to javascript functions that render searchable tables.

In [44]:
library(DT)
datatable(as.data.frame(colData(erset)))
datatables

colData is analogous to the pData that we saw in dealing with the ExpressionSet class.

The $ shortcut is available.

In [45]:
table(erset$ANATOMY)
   BLOOD    BRAIN      ESC      FAT     IPSC    LIVER     LUNG     SKIN 
      15        7        2        1        1        1        2        1 
VASCULAR 
       1 

The purpose of this collection is to illustrate how we might work with cell-type-specific information on chromatin state to interpret other genome-scale findings. A key Bioconductor package for working with BED files is rtacklayer. We'll use the import method to extract information from one interval in one file.

In [46]:
library(rtracklayer)
import(files(erset)[1], which=GRanges("chr1", IRanges(1e6,1.1e6)))
GRanges object with 68 ranges and 1 metadata column:
       seqnames             ranges strand |        name
          <Rle>          <IRanges>  <Rle> | <character>
   [1]     chr1 [ 999001, 1000000]      * |  23_PromBiv
   [2]     chr1 [1000001, 1003600]      * |   24_ReprPC
   [3]     chr1 [1003601, 1004200]      * |  23_PromBiv
   [4]     chr1 [1004201, 1004600]      * |    22_PromP
   [5]     chr1 [1004601, 1004800]      * |      1_TssA
   ...      ...                ...    ... .         ...
  [64]     chr1 [1098401, 1098600]      * |    16_EnhW1
  [65]     chr1 [1098601, 1099000]      * |    19_DNase
  [66]     chr1 [1099001, 1099200]      * |    16_EnhW1
  [67]     chr1 [1099201, 1099800]      * |    17_EnhW2
  [68]     chr1 [1099801, 1100000]      * |    16_EnhW1
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

The next file in the collection gives a different collection of chromatin state labels for a different cell type.

In [47]:
import(files(erset)[2], which=GRanges("chr1", IRanges(1e6,1.1e6)))
GRanges object with 72 ranges and 1 metadata column:
       seqnames             ranges strand |        name
          <Rle>          <IRanges>  <Rle> | <character>
   [1]     chr1 [ 999201, 1000000]      * |  23_PromBiv
   [2]     chr1 [1000001, 1003400]      * |   24_ReprPC
   [3]     chr1 [1003401, 1004000]      * |  23_PromBiv
   [4]     chr1 [1004001, 1004200]      * |     2_PromU
   [5]     chr1 [1004201, 1005000]      * |      1_TssA
   ...      ...                ...    ... .         ...
  [68]     chr1 [1098401, 1098600]      * |     2_PromU
  [69]     chr1 [1098601, 1099200]      * |    16_EnhW1
  [70]     chr1 [1099201, 1099400]      * |    17_EnhW2
  [71]     chr1 [1099401, 1099800]      * |    14_EnhA2
  [72]     chr1 [1099801, 1100000]      * |     2_PromU
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

The stateProfile function uses extractions of this sort to illustrate the layout of chromatin states in intervals upstream of gene coding regions.

In [48]:
stateProfile(erset[,26:31], symbol="IL33", upstream=1000, shortCellType=FALSE)
'select()' returned 1:many mapping between keys and columns
Warning message:
"executing %dopar% sequentially: no parallel backend registered"Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.

Detailed information about the different states is provided in the erma vignette.

In summary

  • BED files are commonly encountered representations of outputs of genome-scale experiments
  • Collections of BED files can be managed with GenomicFiles objects
    • file-level data are bound in as colData
    • intervals of interest are bound in using rowRanges
  • We use rtracklayer::import to retrieve detailed information from BED files
  • An approach to high-level visualization of chromatin states across diverse cell types is provided in erma::stateProfile

Exercise

There are 15 samples with anatomy label 'BLOOD'. List their cell types, using the Standardized.Epigenome.name component of the colData(erset).

In [49]:
#data.frame(celltype=colData(erset[...

A collection of BAM files

The package RNAseqData.HNRNPC.bam.chr14 includes BAM representation of reads derived from an experiment studying the effect of knocking down HNRNPC (the gene coding for heterogeneous nuclear ribonucleoproteins C1/C2) in HeLa cells. The purpose of the experiment is to test the hypothesis that these ribonucleoproteins have a role in preventing erroneous inclusion of Alu elements in transcripts. HNRNPC is located on chr14, and the package has filtered reads from 8 runs to those aligning on chr14.

Setting up the GenomicFiles container

We will use a class and a method from the GenomicFiles package to manage and query the BAM files. The locations of the files are present in the vector RNAseqData.HNRNPC.bam.chr14_BAMFILES.

In [50]:
library(RNAseqData.HNRNPC.bam.chr14)
library(GenomicFiles)
gf = GenomicFiles(files=RNAseqData.HNRNPC.bam.chr14_BAMFILES)
gf
GenomicFiles object with 0 ranges and 8 files: 
files: ERR127306_chr14.bam, ERR127307_chr14.bam, ..., ERR127304_chr14.bam, ERR127305_chr14.bam 
detail: use files(), rowRanges(), colData(), ... 

Defining a region of interest

Note that the object gf is self-describing and has 0 ranges. We bind a GRanges to this object to define focus for upcoming operations. The GRanges that we use corresponds to the location of HNRNPC.

In [51]:
rowRanges(gf) = GRanges("chr14", IRanges(21677295,21737638))
gf
GenomicFiles object with 1 ranges and 8 files: 
files: ERR127306_chr14.bam, ERR127307_chr14.bam, ..., ERR127304_chr14.bam, ERR127305_chr14.bam 
detail: use files(), rowRanges(), colData(), ... 

Adding sample-level data

We can also assign colData to define aspects of the samples.

In [52]:
colData(gf) = DataFrame(trt=rep(c("WT", "KO"), c(4,4)))
In [53]:
gf$trt
  1. 'WT'
  2. 'WT'
  3. 'WT'
  4. 'WT'
  5. 'KO'
  6. 'KO'
  7. 'KO'
  8. 'KO'

A simple method of counting reads aligning to an interval

In the next section we will apply a black-box function that we will study more carefully latter. We are using summarizeOverlaps in GenomicAlignments. The purpose is to tabulate the counts of reads aligned to HNRNPC.

In [54]:
library(BiocParallel)
register(SerialParam())  # can be altered for multicore machines or clusters
hc = suppressWarnings(summarizeOverlaps(rowRanges(gf), files(gf), singleEnd=FALSE)) # some ambiguous pairing
hc
class: RangedSummarizedExperiment 
dim: 1 8 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames(8): ERR127306_chr14.bam ERR127307_chr14.bam ...
  ERR127304_chr14.bam ERR127305_chr14.bam
colData names(0):

summarizeOverlaps returns a new type of object that we will examine more closely below. For the moment, we are concerned with retrieving the read counts from this object, and this uses the assay method.

In [55]:
assay(hc)
ERR127306_chr14.bamERR127307_chr14.bamERR127308_chr14.bamERR127309_chr14.bamERR127302_chr14.bamERR127303_chr14.bamERR127304_chr14.bamERR127305_chr14.bam
271131602946277986 98 158 141

These are raw counts. We will talk about normalization and statistical inference later in the course.

SummarizedExperiment for mature assay collections

The SummarizedExperiment class is defined in the SummarizedExperiment package. We implicitly created one with summarizeOverlaps just above. That was peculiar because it involved only one gene. The general situation is schematized here:

There are "tables" devoted to the features (rowData or rowRanges), assay outputs (assays), and samples (colData).

We'll get acquainted with a typical RNA-seq experiment using the airway package.

In [56]:
library(airway)
data(airway)
airway
class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

Subsets of the SummarizedExperiment are obtained using the X[G, S] idiom. Here we confine attention to 6 features on five samples and extract the counts.

In [57]:
assay(airway[1:6,1:5])
SRR1039508SRR1039509SRR1039512SRR1039513SRR1039516
ENSG00000000003679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419467 515 621 365 587
ENSG00000000457260 211 263 164 245
ENSG00000000460 60 55 40 35 78
ENSG00000000938 0 0 2 0 1

Metadata about the features is recorded at the exon level, in a GRangesList instance. Full details about how the counts were derived is in the vignette for the airway package, and additional approaches are described in Mike Love's notes.

In [58]:
rowRanges(airway)
GRangesList object of length 64102:
$ENSG00000000003 
GRanges object with 17 ranges and 2 metadata columns:
       seqnames               ranges strand |   exon_id       exon_name
          <Rle>            <IRanges>  <Rle> | <integer>     <character>
   [1]        X [99883667, 99884983]      - |    667145 ENSE00001459322
   [2]        X [99885756, 99885863]      - |    667146 ENSE00000868868
   [3]        X [99887482, 99887565]      - |    667147 ENSE00000401072
   [4]        X [99887538, 99887565]      - |    667148 ENSE00001849132
   [5]        X [99888402, 99888536]      - |    667149 ENSE00003554016
   ...      ...                  ...    ... .       ...             ...
  [13]        X [99890555, 99890743]      - |    667156 ENSE00003512331
  [14]        X [99891188, 99891686]      - |    667158 ENSE00001886883
  [15]        X [99891605, 99891803]      - |    667159 ENSE00001855382
  [16]        X [99891790, 99892101]      - |    667160 ENSE00001863395
  [17]        X [99894942, 99894988]      - |    667161 ENSE00001828996

...
<64101 more elements>
-------
seqinfo: 722 sequences (1 circular) from an unspecified genome

The sample level data is obtained using colData:

In [59]:
colData(airway)
DataFrame with 8 rows and 9 columns
           SampleName     cell      dex    albut        Run avgLength
             <factor> <factor> <factor> <factor>   <factor> <integer>
SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
           Experiment    Sample    BioSample
             <factor>  <factor>     <factor>
SRR1039508  SRX384345 SRS508568 SAMN02422669
SRR1039509  SRX384346 SRS508567 SAMN02422675
SRR1039512  SRX384349 SRS508571 SAMN02422678
SRR1039513  SRX384350 SRS508572 SAMN02422670
SRR1039516  SRX384353 SRS508575 SAMN02422682
SRR1039517  SRX384354 SRS508576 SAMN02422673
SRR1039520  SRX384357 SRS508579 SAMN02422683
SRR1039521  SRX384358 SRS508580 SAMN02422677

Exercise

Here is a variation on the twoSamplePlot function that works for SummarizedExperiments. We skip the handling of gene symbols. Given that ENSG00000103196 is the identifier for CRISPLD2, create the visualization comparing expression of this gene in dexamethasone treated samples to controls.

In [60]:
twoSamplePlotSE = function(se, stratvar, rowname) {
    boxplot(split(assay(se[rowname,]), 
                  se[[stratvar]]),xlab=stratvar, ylab=rowname)
}

Comment

The two plotting functions differ because of the different methods required to extract assay results. One way to unify them is to branch within the data preparation step.

In [61]:
makeframe = function(x, rowind, colvar) {
     if (is(x, "ExpressionSet")) xv = exprs
         else if (is(x, "SummarizedExperiment")) xv = assay
     ans = data.frame(t(xv(x[rowind,])), p=x[[colvar]], check.names=FALSE)
     names(ans)[2] = colvar
     ans
     }
makeframe(airway, "ENSG00000103196", "dex")
ENSG00000103196dex
SRR1039508 780 untrt
SRR10395095764 trt
SRR10395121167 untrt
SRR10395134211 trt
SRR1039516 887 untrt
SRR10395173763 trt
SRR1039520 720 untrt
SRR10395215633 trt

There is still the matter of our ability to take advantage of the gene-symbol resolution that we had for the ExpressionSet with fData component. Achieving semantic uniformity across different data representations is a considerable way off. Sometimes you will have to do lookups with reference data objects, sometimes you can look 'within' the data object to get the desired notation mapping.

Wrapping up

  • IRanges represent intervals
  • GRanges use IRanges to represent genomic regions
    • seqnames typically identifies chromosome
    • mcols can be used to annotate regions
    • intra-range and inter-range operations are implemented efficiently
    • the locations of all human genes is given by genes(Homo.sapiens)
    • the GWAS catalog can be represented as a GRanges
    • findOverlaps can be used to identify common locations for diverse phenomena
  • GenomicFiles manages collections of BED or similar files
    • file-level (or sample-level) metadata are bound in with colData
    • regions of interest are bound in with rowRanges
    • rtracklayer::import extracts quantitative information and metadata
    • erma::stateProfile illustrates cell-type-specific variation in chromatin states near genes of interest
    • GenomicFiles can manage BAM collections (see also BamFileList)
    • summarizeOverlaps can obtain read counts in regions of interest
  • SummarizedExperiment coordinates information on samples, features, and assay outputs for many types of experiments