forked from compgenomr/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-genomicIntervals.Rmd
938 lines (695 loc) · 45.4 KB
/
06-genomicIntervals.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
# Operations on Genomic Intervals and Genome Arithmetic {#genomicIntervals}
```{r,setupGenomicIntervals,echo=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE,out.width='65%',fig.width=5.1, cache=TRUE,warning=FALSE,message=FALSE)
#read_chunk('genomicIntervals_chunks.R')
```
A considerable time in computational genomics is spent on overlapping different
features of the genome. Each feature can be represented with a genomic interval
within the chromosomal coordinate system. In addition, each interval can carry
different sorts of information. An interval may for instance represent exon coordinates or a transcription factor binding site. On the other hand,
you can have base-pair resolution, continuous scores over the genome such as read coverage or
scores that could be associated with only certain bases such as in the case of CpG
methylation (See Figure \@ref(fig:gintsum) ).
Typically, you will need to overlap intervals on interest with other features of
the genome, again represented as intervals. For example, you may want to overlap
transcription factor binding sites with CpG islands or promoters to quantify what percentage of binding sites overlap with your regions of interest. Overlapping mapped reads from high-throughput sequencing experiments with genomic features such as exons, promoters, enhancers can also be classified as operations on genomic intervals. You can think of a million other ways that involves overlapping two sets of different features on the genome. This chapter aims to show how to do analysis involving operations on genomic intervals.
```{r,gintsum,fig.cap="Summary of genomic intervals with different kinds of information",fig.align = 'center',out.width='75%',echo=FALSE}
knitr::include_graphics("images/genomeIntervalSummary.png" )
```
## Operations on Genomic Intervals with GenomicRanges package
The [Bioconductor](http://bioconductor.org) project has a dedicated package called [**GenomicRanges**](http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html) to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called `GRanges` object and we will also see some other related objects from the `GenomicRanges` package.
### How to create and manipulate a GRanges object
`GRanges` (from GenomicRanges package) is the main object that holds the genomic intervals and extra information about those intervals. Here we will show how to create one. Conceptually, it is similar to a data frame and some operations such as using **[ ]** notation to subset the table will work also on GRanges, but keep in mind that not everything that works for data frames will work on GRanges objects.
```{r,createGR}
library(GenomicRanges)
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
ranges=IRanges(start=c(50,150,200),
end=c(100,200,300)),
strand=c("+","-","-")
)
gr
# subset like a data frame
gr[1:2,]
```
As you can see it looks a bit like a data frame. Also, note that the peculiar second argument “ranges” which basically contains start and end positions of the genomic intervals. However, you can not just give start and end positions you actually have to provide another object of IRanges. Do not let this confuse you, GRanges actually depends on another object that is very similar to itself called `IRanges` and you have to provide the “ranges” argument as an IRanges object. In its simplest for, an IRanges object can be constructed by providing start and end positions to IRanges() function. Think of it as something you just have to provide in order to construct the GRanges object.
GRanges can also contain other information about the genomic interval such as scores, names, etc. You can provide extra information at the time of the construction or you can add it later. Here is how you can do those:
```{r,createGRwMetadata}
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
ranges=IRanges(start=c(50,150,200),
end=c(100,200,300)),
names=c("id1","id3","id2"),
scores=c(100,90,50)
)
# or add it later (replaces the existing meta data)
mcols(gr)=DataFrame(name2=c("pax6","meis1","zic4"),
score2=c(1,2,3))
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
ranges=IRanges(start=c(50,150,200),
end=c(100,200,300)),
names=c("id1","id3","id2"),
scores=c(100,90,50)
)
# or appends to existing meta data
mcols(gr)=cbind(mcols(gr),
DataFrame(name2=c("pax6","meis1","zic4")) )
gr
# elementMetadata() and values() do the same things
elementMetadata(gr)
values(gr)
# you may also add metadata using the $ operator, as for data frames
gr$name3 = c("A","C", "B")
gr
```
### Getting genomic regions into R as GRanges objects
There are multiple ways you can read in your genomic features into R and create a GRanges object. Most genomic interval data comes as a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as data frame. Now we will show how to convert it to GRanges object. This is one way of doing it, but there are more convenient ways described further in the text.
```{r,convertDataframe2gr}
# read CpGi data set
filePath=system.file("extdata",
"cpgi.hg19.chr21.bed",
package="compGenomRData")
cpgi.df = read.table(filePath, header = FALSE,
stringsAsFactors=FALSE)
# remove chr names with "_"
cpgi.df =cpgi.df [grep("_",cpgi.df[,1],invert=TRUE),]
cpgi.gr=GRanges(seqnames=cpgi.df[,1],
ranges=IRanges(start=cpgi.df[,2],
end=cpgi.df[,3]))
```
You may need to do some pre-processing before/after reading in the BED file. Below is an example of getting transcription start sites from BED files containing refseq transcript locations.
```{r,convertDataframe2grTSS}
# read refseq file
filePathRefseq=system.file("extdata",
"refseq.hg19.chr21.bed",
package="compGenomRData")
ref.df = read.table(filePathRefseq, header = FALSE,
stringsAsFactors=FALSE)
ref.gr=GRanges(seqnames=ref.df[,1],
ranges=IRanges(start=ref.df[,2],
end=ref.df[,3]),
strand=ref.df[,6],name=ref.df[,4])
# get TSS
tss.gr=ref.gr
# end of the + strand genes must be equalized to start pos
end(tss.gr[strand(tss.gr)=="+",]) =start(tss.gr[strand(tss.gr)=="+",])
# startof the - strand genes must be equalized to end pos
start(tss.gr[strand(tss.gr)=="-",])=end(tss.gr[strand(tss.gr)=="-",])
# remove duplicated TSSes ie alternative transcripts
# this keeps the first instance and removes duplicates
tss.gr=tss.gr[!duplicated(tss.gr),]
```
Another way of doing this is from a BED file is to use `readTranscriptfeatures()`
function from the `genomation` package. This function takes care of the steps described in the code chunk above.
Reading the genomic features as text files and converting to GRanges is not the only way to create GRanges object. With the help of the [rtracklayer](http://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) package we can directly import BED files.
```{r,importbed_rtracklayer,eval=FALSE}
require(rtracklayer)
# we are reading a BED file, the path to the file
# is stored in filePathRefseq variable
import.bed(filePathRefseq)
```
Next, we will show how to use other methods to automatically obtain the data in GRanges format from online databases. But you will not be able to use these methods for every data set so it is good to now how to read data from flat files as well. We will use rtracklayer package to download data from [UCSC browser](http://genome-euro.ucsc.edu/cgi-bin/hgGateway?hgsid=197855328_moRAG0Dqe1qlcOEtNDAt09E1e6ab). We will download CpG islands as GRanges objects. The `rtracklayer` workflow we show below works like using the UCSC table browser. You need to select which species you are working with, then you need to select which dataset you need to download and lastly you download the UCSC dataset or track as `GRanges` object.
```{r,importFromUCSC,eval=FALSE}
require(rtracklayer)
session <- browserSession("UCSC",url = 'http://genome-euro.ucsc.edu/cgi-bin/')
genome(session) <- "mm9"
## choose CpG island track on chr12
query <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt",
range=GRangesForUCSCGenome("mm9", "chr12"))
## get the GRanges object for the track
track(query)
```
There is also an interface to Ensembl database called [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html).
This package will enable you to access and import all of the datasets included
in Ensembl. Another similar package is [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html).
. This package is an aggregator for different datasets from various sources.
Using AnnotationHub one can access data sets from UCSC browser, Ensembl browser
and data sets from genomics consortiums such as ENCODE and RoadmapEpigenomics.
Although, we do not provide examples, we invite curious readers to read the
vignettes of the aforementioned annotation packages.
#### Frequently used file formats and how to read them into R as a table
There are multiple file formats in genomics but some of them you will see more
frequently than others. We already mentioned some of them. Here is a list of files
and functions to read them into R as `GRanges` objects or something coercible to
GRanges objects.
**BED**: These are used and popularized by UCSC browser, and can hold a variety of
information including exon/intron structure of transcripts in a single line.
- `genomation::readBed()`
- `genomation::readTranscriptFeatures()` good for getting intron/exon/promoters from BED12 files
- `rtracklayer::import.bed()`
**GFF**: GFF format is a tabular text format for genomic features similar to BED. However,
it is a more flexible format than BED, which makes it harder to parse at times. Many gene annotation files are in this format.
- `genomation::gffToGranges()`
- `rtracklayer::impot.gff()`
**BAM**: BAM format is compressed and indexed tabular file format designed for sequencing
reads.
- `GenomicAlignments::readGAlignments`
- `Rsamtools::scanBam` returns a data frame with columns from SAM/BAM file.
**BigWig**: This is used to for storing scores associated with genomic intervals. It is an indexed format. Similar to BAM, this makes it easier to query and only necessary portions
of the file could be loaded into memory.
- `rtracklayer::import.bw()`
**Generic Text files**: This represents any text file with the minimal information of chromosome, start and end coordinates.
- `genomation::readGeneric()`
**Tabix/Bcf**: These are tabular file formats indexed and compressed similar to
BAM. The following functions return lists rather than tabular data structures. These
formats are mostly used to store genomic variation data such as SNPs and indels.
- `Rsamtools::scanTabix`
- `Rsamtools::scanBcf`
### Finding regions that do/do not overlap with another set of regions
This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by [ChIP-seq](http://en.wikipedia.org/wiki/ChIP-sequencing) experiments. In these types of experiments and followed analysis, one usually ends up with genomic regions that are bound by transcription factors. One of the standard next questions would be to annotate binding sites with genomic annotations such as promoter,exon,intron and/or CpG islands. Below is a demonstration of how transcription factor binding sites can be annotated using CpG islands. First, we will get the subset of binding sites that overlap with the CpG islands. In this case, binding sites are ChIP-seq peaks.
We can find the subset of peaks that overlap with the CpG islands using the **subsetByoverlaps()** function. You will also see another way of converting **data frames** to **GRanges**.
```{r,findPeakwithCpGi}
library(genomation)
filePathPeaks=system.file("extdata", "wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz",
package="compGenomRData")
# read the peaks from a bed file
pk1.gr=readBroadPeak(filePathPeaks)
# get the peaks that overlap with CpG islands
subsetByOverlaps(pk1.gr,cpgi.gr)
```
For each CpG island, we can count the number of peaks that overlap with a given CpG island with countOverlaps().
```{r,countOverlaps}
counts=countOverlaps(pk1.gr,cpgi.gr)
head(counts)
```
The `findOverlaps()` function can be used to see one-to-one overlaps between peaks and CpG islands. It returns a matrix showing which peak overlaps with which CpGi island.
```{r,findOverlaps}
findOverlaps(pk1.gr,cpgi.gr)
```
Another interesting thing would be to look at the distances to nearest CpG islands for each peak. In addition, just finding the nearest CpG island could also be interesting. Often times, you will need to find nearest TSS or gene to your regions of interest, and the code below is handy for doing that.
```{r,findNearest,fig.cap="histogram of distances"}
# find nearest CpGi to each TSS
n.ind=nearest(pk1.gr,cpgi.gr)
# get distance to nearest
dists=distanceToNearest(pk1.gr,cpgi.gr,select="arbitrary")
dists
# histogram of the distances to nearest TSS
dist2plot=mcols(dists)[,1]
hist(log10(dist2plot),xlab="log10(dist to nearest TSS)",
main="Distances")
```
## Dealing with mapped high-throughput sequencing reads
The reads from sequencing machines are usually pre-proccessed and aligned to the genome with the help of specific bioinformatics tools. We have introduced the read processing , quality check and alignment methods in chapter \@ref(processingReads). In this section we will deal with mapped reads. Since each mapped read has a start and end position the genome, mapped reads can be thought as genomic intervals stored in a file. After mapping, the next task is to quantify the enrichment of those aligned reads in the regions of interest. You may want to count how many reads overlapping with your promoter set of interest or you may want to quantify RNA-seq reads overlapping with exons. This is similar to operations on genomic intervals which are described previously. If you can read all your alignments into the memory and create a `GRanges` object, you can apply the previously described operations. However, most of the time we can not read all mapped reads into the memory, so we have to use specialized tools to query and quantify alignments on a given set of regions. One of the most common alignment formats is SAM/BAM format, most aligners will produce SAM/BAM output or you will be able to convert your specific alignment format to SAM/BAM format. The BAM format is a binary version of the human readable SAM format. The SAM format has specific columns that contain different kind of information about the alignment such as mismatches, qualities etc. (see [http://samtools.sourceforge.net/SAM1.pdf] for SAM format specification).
### Counting mapped reads for a set of regions
Rsamtools package has functions to query BAM files. The function we will use in the first example is countBam which takes input of the BAM file and param argument. “param” argument takes a ScanBamParam object. The object is instantiated using **ScanBamParam()** and contains parameters for scanning the BAM file. The example below is a simple example where ScanBamParam() only includes regions of interest, promoters on chr21.
```{r,countBam}
promoter.gr=tss.gr
start(promoter.gr)=start(promoter.gr)-1000
end(promoter.gr) =end(promoter.gr)+1000
promoter.gr=promoter.gr[seqnames(promoter.gr)=="chr21"]
library(Rsamtools)
bamfilePath=system.file("extdata",
"wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
package="compGenomRData")
# get reads for regions of interest from the bam file
param <- ScanBamParam(which=promoter.gr)
counts=countBam(bamfilePath, param=param)
```
Alternatively, aligned reads can be read in using GenomicAlignments package (which on this occasion relies on RSamtools package).
```{r,readGAlignments}
library(GenomicAlignments)
alns <- readGAlignments(bamfilePath, param=param)
```
## Dealing with continuous scores over the genome
Most high-throughput data can be viewed as a continuous score over the bases of the genome. In case of RNA-seq or ChIP-seq experiments the data can be represented as read coverage values per genomic base position. In addition, other information (not necessarily from high-throughput experiments) can be represented this way. The GC content and conservation scores per base are prime examples of other data sets that can be represented as scores. This sort of data can be stored as a generic text file or can have special formats such as Wig (stands for wiggle) from UCSC, or the bigWig format is which is indexed binary format of the wig files. The bigWig format is great for data that covers large fraction of the genome with varying scores, because the file is much smaller than regular text files that have the same information and it can be queried easier since it is indexed.
In R/Bioconductor, the continuous data can also be represented in a compressed format, in a format called Rle vector, which stands for run-length encoded vector. This gives superior memory performance over regular vectors because repeating consecutive values are represented as one value in the Rle vector (See Figure \@ref(fig:Rle) ).
```{r,Rle,fig.cap="Rle encoding explained",fig.align = 'center',out.width='100%',echo=FALSE}
knitr::include_graphics("images/Rle_demo.png" )
```
Typically, for genome-wide data you will have a RleList object which is a list of Rle vectors per chromosome. You can obtain such vectors by reading the reads in and calling *coverage()* function from *GenomicRanges* package. Let's try that on the above data set.
```{r,getCoverageFromAln}
covs=coverage(alns) # get coverage vectors
covs
```
Alternatively, you can get the coverage from the Bam file directly. Below, we are getting the coverage directly from the Bam file for our previously defined promoters.
```{r,getCoverageFromBam}
covs=coverage(bamfilePath, param=param) # get coverage vectors
```
One of the most common ways of storing score data is, as mentioned, wig or bigWig format. Most of the ENCODE project data can be downloaded in bigWig format. In addition, conservation scores can also be downloaded as wig/bigWig format. You can import bigWig files into R using *import()* function from *rtracklayer* package. However, it is generally not advisable to read the whole bigWig file in memory as it was the case with BAM files. Usually, you will be interested in only a fraction of the genome, such as promoters, exons etc. So it is best you extract the data for those regions and read those into memory rather than the whole file. Below we read the a bigWig file only for promoters. The operation returns an *GRanges* object with score column which indicates the scores in the BigWig file per genomic region.
```{r,getRleFromBigWig}
library(rtracklayer)
# File from ENCODE ChIP-seq tracks
bwFile=system.file("extdata",
"wgEncodeHaibTfbsA549.chr21.bw",
package="compGenomRData")
bw.gr=import(bwFile, which=promoter.gr) # get coverage vectors
bw.gr
```
Following this we can create an `RleList` object from the GRanges with `coverage`
function.
```{r,BigWigCov}
cov.bw=coverage(bw.gr,weight = "score")
# or get this directly from
cov.bw=import(bwFile, which=promoter.gr,as = "RleList")
```
### Extracting subsections of Rle and RleList objects
Frequently, we will need to extract subsections of the Rle vectors or RleList objects.
We will need to do this to visualize that subsection or get some statistics out
of those sections. For example, we could be interested in average coverage per
base for the regions we are interested in. We have to extract those regions
from RleList object and apply summary statistics. Below, we show how to extract
subsections of RleList object. We are extracting promoter regions from ChIP-seq
read coverage RleList. Following that, we will plot the one of the promoters associated coverage values.
```{r,getViews,fig.cap="Coverage vector extracted from RleList via Views() function is plotted as a line plot."}
myViews=Views(cov.bw,as(promoter.gr,"IRangesList")) # get subsets of coverage
# there is a views object for each chromosome
myViews
myViews[[1]]
# get the coverage vector from the 5th view and plot
plot(myViews[[1]][[5]],type="l")
```
Next, we are interested in average coverage per base for the promoters using summary
functions that works on Views object.
```{r, viewMeans}
# get the mean of the views
head(
viewMeans(myViews[[1]])
)
# get the max of the views
head(
viewMaxs(myViews[[1]])
)
```
## Genomic intervals with more information: SummarizedExperiment class
As we have seen, genomic intervals can be mainly contained in a `GRanges` object.
It can also contain additional columns associated with each interval, here
you can save information such as read counts or other scores associated with the
interval. However,
genomics data is often have many layers. With `GRanges` you can have a table
associated with the intervals, but what happens if you have many tables and each
table has some metadata associated with it. In addition, rows and columns might
have additional annotation that can not be contained by row or column names.
For these cases, `SummarizedExperiment` class is ideal. It can hold multi-layered
tabular data associated with each genomic interval and the meta-data associated with
rows and columns, or associated with each table. For example, genomic intervals
associated with the SummarizedExperiment object can be gene locations, and
each tabular data structure can be RNA-seq read counts in a time course experiment.
Each table could represent different conditions in which experiments performed.
The SummarizedExperiment class is outlined in the figure below (Figure \@ref(fig:SumExpOv) ).
```{r,SumExpOv,fig.cap="Overview of SummarizedExperiment class and functions. Adapted from SummerizedExperiment package vignette",fig.align = 'center',out.width='100%',echo=FALSE}
knitr::include_graphics("images/Summarized.Experiment.png" )
```
### Create a SummarizedExperiment object
Here we show how to create a basic SummarizedExperiment object. We will first
create a matrix of read counts. This matrix will represent read counts from
a series RNA-seq experiments from different timepoints. Following that,
we create `GRanges` object to represent the locations of the genes, and a table
for column annotations. This will include the names for the columns and any
other value we want to represent. Finally, we will create a `SummarizedExperiment`
object by combining all those pieces.
```{r,sumExpCreate}
# simulate an RNA-seq read counts table
nrows <- 200
ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
# create gene locations
rowRanges <- GRanges(rep(c("chr1", "chr2"), c(50, 150)),
IRanges(floor(runif(200, 1e5, 1e6)), width=100),
strand=sample(c("+", "-"), 200, TRUE),
feature_id=paste0("gene", 1:200))
# create table for the columns
colData <- DataFrame(timepoint=1:6,
row.names=LETTERS[1:6])
# create SummarizedExperiment object
se=SummarizedExperiment(assays=list(counts=counts),
rowRanges=rowRanges, colData=colData)
se
```
### Subset and manipulate the SummarizedExperiment object
Now that we have a SummarizedExperiment object, we can subset it and extract/change
parts of it.
#### Extracting parts of the object
`colData()` and `rowData()` extract the column associated and row associated
tables. `metaData()` extracts the meta-data table if there is any table associated.
```{r,extractSe}
colData(se) # extract column associated data
rowData(se) # extrac row associated data
```
To extract the main table or tables that contain the values of interest such
as read counts. We must use the `assays()` function. This returns a list of
DataFrame objects associated with the object.
```{r,assaysSe}
assays(se) # extract list of assays
```
You can use names with `$` or `[]` notaion to extract specific tables from the list.
```{r,assaysSe2,eval=FALSE}
assays(se)$counts # get the table named "counts"
assays(se)[[1]] # get the first table
```
#### subsetting
Subsetting is easy using `[ ]` notation. This is similar to the way we
subset data frames or matrices.
```{r,subsetSe1}
# subset the first five transcripts and first three samples
se[1:5, 1:3]
```
One can also use `$` operator to subset based on colData() columns. You can
extract certain samples or in our case time points.
```{r,subsetSe2,eval=FALSE}
se[, se$timepoint == 1]
```
In addition, as `SummarizedExperiment` objects are `GRanges` objects on streoids,
they support all of the `findOverlaps()` methods and associated functions that
work on `GRanges` objects.
```{r,subsetByolapSE}
# Subset for only rows which are in chr1:100,000-1,100,000
roi <- GRanges(seqnames="chr1", ranges=100000:1100000)
subsetByOverlaps(se, roi)
```
## Visualizing and summarizing genomic intervals
Data integration and visualization is corner stone of genomic data analysis. Below, we will
show different ways of integrating and visualizing genomic intervals. These methods
can be use to visualize large amounts of data in a locus-specific or multi-loci
manner.
### Visualizing intervals on a locus of interest
Often times, we will be interested in particular genomic locus and try to visualize
different genomic datasets over that locus. This is similar to looking at the data
over one of the genome browsers. Below we will display genes, GpG islands and read
coverage from a ChIP-seq experiment using Gviz package.For Gviz, we first need to
set the tracks to display. The tracks can be in various formats. They can be R
objects such as IRanges,GRanges and data.frame, or they can be in flat file formats
such as BigWig,BED and BAM. After the tracks are set, we can display them with
`plotTracks` function.
```{r,fig.cap="tracks visualized using Gviz"}
library(Gviz)
# set tracks to display
# set CpG island track
cpgi.track=AnnotationTrack(cpgi.gr,
name = "CpG")
# set gene track
# we will get this from EBI Biomart webservice
gene.track <- BiomartGeneRegionTrack(genome = "hg19",
chromosome = "chr21",
start = 27698681, end = 28083310,
name = "ENSEMBL")
# set track for ChIP-seq coverage
chipseqFile=system.file("extdata",
"wgEncodeHaibTfbsA549.chr21.bw",
package="compGenomRData")
cov.track=DataTrack(chipseqFile,type = "l",
name="coverage")
# call the display function plotTracks
track.list=list(cpgi.track,gene.track,cov.track)
plotTracks(track.list,from=27698681,to=28083310,chromsome="chr21")
```
### Summaries of genomic intervals on multiple loci
Looking at data one region at a time could be inefficient. One can summarize
different data sets over thousands of regions of interest and identify patterns.
This summaries can include different data types such as motifs, read coverage
and other scores associated with genomic intervals. The `genomation` package can
summarize and help identify patterns in the datasets. The datasets can have
different kinds of information and multiple file types can be used such as BED, GFF, BAM and bigWig. We will look at H3K4me3 ChIP-seq and DNAse-seq signals from H1 embryonic stem cell line. H3K4me3 is usually associated with promoters and regions with high DNAse-seq signal are associated with accessible regions, that means mostly regulatory regions. We will summarize those datasets around the transcription start sites (TSS) of genes on chromosome 20 of human hg19 assembly. We will first read the genes and extract the region around TSS, 500bp upstream and downstream. We will then create a matrix of ChIP-seq scores for those regions, each row will represent a region around a specific TSS and columns will be the scores per base. We will then plot, average enrichment values around the TSSes of genes on chromosome 20.
```{r,fig.cap="meta region plot using genomation"}
# get transcription start sites on chr20
library(genomation)
transcriptFile=system.file("extdata",
"refseq.hg19.chr20.bed",
package="compGenomRData")
feat=readTranscriptFeatures(transcriptFile,
remove.unusual = TRUE,
up.flank = 500, down.flank = 500)
prom=feat$promoters # get promoters from the features
# get for H3K4me3 values around TSSes
# we use strand.aware=TRUE so - strands will
# be reversed
H3K4me3File=system.file("extdata",
"H1.ESC.H3K4me3.chr20.bw",
package="compGenomRData")
sm=ScoreMatrix(H3K4me3File,prom,
type="bigWig",strand.aware = TRUE)
# look for the average enrichment
plotMeta(sm, profile.names = "H3K4me3", xcoords = c(-500,500),
ylab="H3K4me3 enrichment",dispersion = "se",
xlab="bases around TSS")
```
The pattern we see is expected, there is a dip just around TSS and signal is more
intense on the downstream of the TSS. We can also plot a heatmap where each row is a
region around TSS and color coded by enrichment. This can show us not only the
general pattern as in the meta-region
plot but also how many of the regions produce such a pattern.
```{r,fig.cap="Heatmap of enrichment of H3K4me2 around TSS"}
heatMatrix(sm,order=TRUE,xcoords = c(-500,500),xlab="bases around TSS")
```
Here we saw that about half of the regions do not have any signal. In addition it seems the multi-modal profile we have observed earlier is more complicated. Certain regions seems to have signal on both sides of the TSS, whereas others have signal mostly on the downstream side.
Normally, there would be more than one experiment or we can integrate datasets from
public repositories. In this case, we can see how different signals look like on the regions we are interested in. Now, we will also use DNAse-seq data and create a list of matrices with our datasets and plot the average profile of the signals from both datasets.
```{r,fig.cap= "Average profiles of DNAse and H3K4me3 ChIP-seq",out.width='60%'}
DNAseFile=system.file("extdata",
"H1.ESC.dnase.chr20.bw",
package="compGenomRData")
sml=ScoreMatrixList(c(H3K4me3=H3K4me3File,
DNAse=DNAseFile),prom,
type="bigWig",strand.aware = TRUE)
plotMeta(sml)
```
We should now look at the heatmaps side by side and we should also cluster the rows
based on their similarity. We will be using `multiHeatMatrix` since we have multiple ScoreMatrix objects in the list. In this case, we will also use `winsorize` argument to limit extreme values,
every score above 95th percentile will be equalized the the value of the 95th percentile. In addition, `heatMatrix` and `multiHeatMatrix` can cluster the rows.
Below, we will be using k-means clustering with 3 clusters.
```{r,multiHeatMatrix,fig.cap= "Heatmaps of H3K4me3 and DNAse data",out.width='60%'}
set.seed(1029)
multiHeatMatrix(sml,order=TRUE,xcoords = c(-500,500),
xlab="bases around TSS",winsorize = c(0,95),
matrix.main = c("H3K4me3","DNAse"),
column.scale=TRUE,
clustfun=function(x) kmeans(x, centers=3)$cluster)
```
This revealed a different picture than we have observed before. Almost half of the promoters have no signal for DNAse or H3K4me3; these regions are probably not active and associated genes are not expressed. For regions with H3K4me3 signal, there are two major patterns. One pattern where both downstream and upstream of the TSS are enriched. On the other pattern, mostly downstream of the TSS is enriched.
### Making karyograms and circos plots
Chromosomal karyograms and circos plots are beneficial for displaying data over the
whole genome of chromosomes of interest. Although,the information that can be
displayed over these large regions are usually not very clear and only large trends
can be discerned by eye, such as loss of methylation in large regions or genome-wide.
Below, we are showing how to use `ggbio` package for plotting.
This package has a slightly different syntax than base graphics. The syntax follows
[grammar of graphics](http://www.springer.com/de/book/9780387245447) logic. It is
a deconstructed way of thinking about the plot. You add your data and apply mappings
and transformations in order to achieve the final output. In ggbio, things are
relatively easy since a high-level function `autoplot` function will recognize
most of the datatypes and guess the most appropriate plot type. You can change
it is behavior by applying low-level functions. We first get the sizes of chromosomes
and make a karyogram template.
```{r,karyo1,fig.cap= "Karyogram example"}
library(ggbio)
data(ideoCyto, package = "biovizBase")
p <- autoplot(seqinfo(ideoCyto$hg19), layout = "karyogram")
p
```
Next, we would like to plot CpG islands on this Karyogram. We simply do this
by adding a layer with `layout_karyogram` function.
```{r,karyo2,fig.cap= "Karyogram of CpG islands over the human genome"}
# read CpG islands from a generic text file
CpGiFile=filePath=system.file("extdata",
"CpGi.hg19.table.txt",
package="compGenomRData")
cpgi.gr=genomation::readGeneric(CpGiFile,
chr = 1, start = 2, end = 3,header=TRUE,
keep.all.metadata =TRUE,remove.unusual=TRUE )
p + layout_karyogram(cpgi.gr)
```
Next, we would like to plot some data over the chromosomes. This could be ChIP-seq
signal
or any other signal over the genome, we will use CpG island scores from the data set
we read earlier. We will plot a point proportional to "obsExp" column in the data set. We use `ylim` argument to squish the chromosomal rectangles and plot on top of those. `aes` argument defines how the data is mapped to geometry. In this case,
it says the points will have x coordinate from CpG island start positions and y coordinate from obsExp score of CpG islands.
```{r,karyoCpG,fig.cap="Karyogram of CpG islands and their observed/expected scores over the human genome"}
p + layout_karyogram(cpgi.gr, aes(x= start, y = obsExp),
geom="point",
ylim = c(2,50), color = "red",
size=0.1,rect.height=1)
```
Another way to depict regions or quantitative signals on the chromosomes is circos plots. These are circular plots usually used for showing chromosomal rearrangements, but can also be used for depicting signals.`ggbio` package can produce all kinds of circos plots. Below, we will show how to use that for our CpG island score example.
```{r,"circosCpG",fig.cap="circos plot for CpG islands scores"}
# set the chromsome in a circle
# color set to white to look transparent
p <- ggplot() + layout_circle(ideoCyto$hg19, geom = "ideo", fill = "white",
colour="white",cytoband = TRUE,
radius = 39, trackWidth = 2)
# plot the scores as points
p <- p + layout_circle(cpgi.gr, geom = "point", grid=TRUE,
size = 0.01, aes(y = obsExp),color="red",
radius = 42, trackWidth = 10)
# set the chromosome names
p <- p + layout_circle(as(seqinfo(ideoCyto$hg19),"GRanges"),
geom = "text", aes(label = seqnames),
vjust = 0, radius = 55, trackWidth = 7,
size=3)
# display the plot
p
```
## Exercises
```
> dir()
[1] "GenomicInterval.exercises.html"
```
The data for the exercises is located at `GenomicIntervals_data/data` folder.
Run the following to see the data files.
```
dir("GenomicIntervals_data/data")
```
### Operations on Genomic Intervals with GenomicRanges package
####
Create a GRanges object using the information in the table below:
| chr | start | end |strand | score |
| :--- |:------| :-----| :-----|:-----|
| chr1 | 10000 | 10300 | + | 10 |
| chr1 | 11100 | 11500 | - | 20 |
| chr2 | 20000 | 20030 | + | 15 |
####
use `start()`, `end()`, `strand()`,`seqnames()` and `width()` functions on the GRanges
object you created. Figure out what they are doing. Can you get a subset of GRanges object for intervals that are only on + strand? If you can do that, try getting intervals that are on chr1. *HINT:* GRanges objects can be subset using [ ] operator similar to data.frames but you may need
to use `start()`, `end()` and `strand()`,`seqnames()` within the [ ].
```{r,eval=FALSE,echo=FALSE}
gr0=GRanges(seqnames=c("chr1","chr1","chr2"),
range=IRanges(start=c(10000,11100,2000),
end=c(10300,11500,20030)),
score=c(10,20,15))
start(gr0)
end(gr0)
strand(gr0)
seqnames(gr0)
width(gr0)
gr0[seqnames(gr0)=="chr1",]
```
####
Import mouse (mm9 assembly) CpG islands and refseq transcripts for chr12 from UCSC browser as `GRanges` objects using `rtracklayer` functions. HINT: Check the lecture material and modify the code there as necessary. If that somehow does not work, go to UCSC browser and download it as a BED file. The trackname for Refseq genes is "RefSeq Genes" and table name is "refGene".
```{r,eval=FALSE,echo=FALSE}
require(rtracklayer)
session <- browserSession("UCSC",url='http://genome-euro.ucsc.edu/cgi-bin/')
genome(session) <- "mm9"
## choose CpG island track on chr12
cpgi <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt",
range=GRangesForUCSCGenome("mm9", "chr12"))
## get the GRanges object for the track
cpgi.gr=track(cpgi)
## choose CpG island track on chr12
query <- ucscTableQuery(session, track="RefSeq Genes",table="refGene",
range=GRangesForUCSCGenome("mm9", "chr12"))
## get the GRanges object for the track
ref.gr=track(query)
```
####
Following from the exercise above, get the promoters of Refseq transcripts (-1000bp and +1000 bp of the TSS) and calculate what percentage of them overlap with CpG islands. HINT: You have to get the promoter coordinates and use `findOverlaps()` or `subsetByOverlaps()` from `GenomicRanges` package. To get promoters, type `?promoters` on the R console and see how to use that function to get promoters or calculate their coordinates as shown in the lecture material.
```{r,eval=FALSE,echo=FALSE}
proms=promoters(ref.gr, upstream=1000, downstream=1000)
sub.cpgi=subsetByOverlaps(cpgi.gr,proms)
100*length(sub.cpgi)/length(cpgi.gr)
```
####
Plot the distribution of CpG island lengths for CpG islands that overlap with the
promoters.
```{r,eval=FALSE,echo=FALSE}
hist(width(sub.cpgi))
median(width(sub.cpgi))
```
####
Get canonical peaks for SP1 (peaks that are in both replicates) on chr21. Peaks for each replicate are located in `GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz` and `GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gz` files. HINT: You need to use `findOverlaps()` or `subsetByOverlaps()` to get the subset of peaks that occur in both replicates. You can try to read *broadPeak.gz files using genomation function `readBroadPeak`, broadPeak is just an extended BED format.
EXTRA credit: Try use `coverage()` and slice`()` functions to get canonical peaks.
```{r,eval=FALSE}
library(genomation)
rep1=readBroadPeak("GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz")
rep2=readBroadPeak("GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gz")
```
```{r,eval=FALSE,echo=FALSE}
subsetByOverlaps(rep1,rep2)
```
### Dealing with mapped high-throughput sequencing reads
####
Count the reads overlapping with canonical Sp1 peaks using the BAM file for one of the replicates: `GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. **HINT**:
Use functions from `GenomicAlignments`, see [lecture notes](GenomicIntervals_data/lectures/genomicIntervalsTutorial.html).
```{r,eval=FALSE,echo=FALSE}
canon=subsetByOverlaps(rep1,rep2)
library(Rsamtools)
bamfile="GenomicIntervals_data/data/wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam"
# get reads for regions of interest from the bam file
param <- ScanBamParam(which=canon)
counts=countBam(bamfile, param=param)
```
### Dealing with contiguous scores over the genome
####
Extract Views object for the promoters on chr20 from `H1.ESC.H3K4me1.chr20.bw` file available at **CompGenomRData** package. Plot the first "View" as a line plot. **HINT**: see the code in the relevant section and adapt the code from there.
####
Make a histogram of the maximum signal for the Views in the object you extracted above. You can use any of the view summary functions or use lapply() and write your own summary function.
####
Get the genomic positions of maximum signal in each view and make a GRanges object. **HINT**: See ?viewRangeMaxs help page. Try to make a GRanges object out of the returned object.
### Visualizing and summarizing genomic intervals
####
Extract -500,+500 bp regions around TSSes on chr21, there are refseq files in the compGenomRData package or you can
pull the data out of UCSC browser. Use SP1 ChIP-seq data in the compGenomRData package, access the file path via `system.file() `command
(`wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam` ) to create an average profile of read coverage around TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these possible using `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files.
```{r,eval=FALSE,echo=FALSE}
transcriptFilechr21=system.file("extdata",
"refseq.hg19.chr21.bed",
package="compGenomRData")
gene.chr21=readTranscriptFeatures(transcriptFilechr21,
up.flank = 500,down.flank = 500)
prom.chr21=gene.chr21$promoters
bamFile=system.file("extdata",
"wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
package="compGenomRData")
sm=ScoreMatrix(bamFile,prom.chr21,
type="bam",strand.aware = TRUE)
plotMeta(sm)
```
####
Extract -500,+500 bp regions around TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the compGenomRData package and create heatmaps and average signal profiles for regions around the TSSes.
```{r,eval=FALSE,echo=FALSE}
transcriptFile=system.file("extdata",
"refseq.hg19.chr20.bed",
package="compGenomRData")
gene.chr20=readTranscriptFeatures(transcriptFile,up.flank = 500,down.flank = 500)
prom.chr20=gene.chr20$promoters
h3k4me3File=system.file("extdata",
"H1.ESC.H3K4me3.chr20.bw",
package="compGenomRData")
H3K27acFile=system.file("extdata",
"H1.ESC.H3K27ac.chr20.bw",
package="compGenomRData")
sml=ScoreMatrixList(c(h3k4me3File,
H3K27acFile),prom.chr20,
type="bigWig",strand.aware = TRUE)
# look for the average enrichment
plotMeta(sml, profile.names = c("H3K4me3","H3K27ac"), xcoords = c(-500,500),
ylab="log2enrichment",dispersion = "se",
xlab="bases around TSS")
multiHeatMatrix(sml,order=TRUE,winsorize = c(0,95),
matrix.main = c("H3K4me3","H3K27ac"))
```
####
Download P300 ChIP-seq peaks data from UCSC browser. The peaks are locations where P300 binds. The P300 binding marks enhancer regions in the genome. (HINT: group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name )
Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20, use data from compGenomRData package. Make multi-heatmaps and metaplots, what is different from TSS profiles ?
```{r,eval=FALSE}
library(genomation)
peaks=readBed("~/Downloads/chip.peaks.bed")
p300.gr=peaks[mcols(peaks)$name=="EP300",]
rs.p300.gr=resize(p300.gr,width=1000,fix="center" )
rs.p300.gr=rs.p300.gr[seqnames(rs.p300.gr)=="chr20",]
dnaseFile=system.file("extdata",
"H1.ESC.dnase.chr20.bw",
package="compGenomRData")
sml=ScoreMatrixList(c(h3k4me3File,
H3K27acFile,dnaseFile),rs.p300.gr,
type="bigWig",strand.aware = FALSE)
plotMeta(sml)
multiHeatMatrix(sml,
winsorize = c(0,95),
matrix.main=c("H3K4me3","H3K27ac","DNase"),
clustfun=function(x) kmeans(x, centers=3)$cluster)
```
####
Cluster the rows of multi-heatmaps? Are there obvious clusters ? HINT: check arguments of multiHeatMatrix() function.
####
Visualize one of the -500,+500 bp regions around TSS using `Gviz` functions. You should visualize both H3K4me3 and H3K27ac and the gene models.
```{r,eval=FALSE,echo=FALSE}
library(Gviz)
H3K4me3.track=DataTrack("GenomicIntervals_data/data/H1.ESC.H3K4me3.chr20.bw",type ="l",
chromosome = "chr20",
name="H3K4me3")
H3K27ac.track=DataTrack("GenomicIntervals_data/data/H1.ESC.H3K27ac.chr20.bw",type = "l",
name="H3K27ac")
track.list=list(H3K4me3.track,H3K27ac.track)
prom.chr20[1,]
prom.chr20[2,]
prom.chr20[729,]
plotTracks(track.list,from=49547020,to=49548020,
chromosome="chr20")
```