10-bs-seq-analysis.Rmd

# DNA methylation analysis using bisulfite sequencing data {#bsseq}


```{r, bssetup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE,out.width='65%',fig.width=5.1, cache=TRUE,warning=FALSE,message=FALSE)
```

Epigenome consists of chemical modifications of DNA and histones. These modifications are shown to be associated with gene regulation in various settings (see Chapter \@ref(intro) for an intro). These modifications in turn have specific importance for cell type identification. There are many different ways of measuring such modifications. We have shown how histone modifications can be measured in a genome-wide manner in Chapter \@ref(chipseq) using ChIP-seq. In this chapter we will focus on the analysis of DNA methylation data using data from a technique called Bisulfite sequencing (BS-seq). We will introduce how to process data and data quality checks, as well as statistical analysis relevant for BS-seq data.

## What is DNA methylation ?
Cytosine methylation (5-methylcytosine, 5mC) is one of the main covalent base modifications in eukaryotic genomes, generally observed on CpG dinucleotides. Methylation can also rarely occur in non-CpG context, but this was mainly observed in human embryonic stem and neuronal cells [@Lister2009-sd; @Lister2013-vs]. DNA methylation is a part of the epigenetic regulation mechanism of gene expression. It is cell-type specific DNA modification. It is reversible but mostly remains stable through cell division.  There are roughly 28 million CpGs in the human genome, 60–80% are generally methylated. Less than 10% of CpGs occur in CG-dense regions that are termed CpG islands in the human genome [@Smith2013-jh]. It has been demonstrated that DNA methylation is also not uniformly distributed over the genome, but rather is associated with CpG density. In vertebrate genomes, cytosine bases are usually unmethylated in CpG-rich regions such as CpG islands and tend to be methylated in CpG-deficient regions. Vertebrate genomes are largely CpG deficient except at CpG islands. Conversely, invertebrates such as Drosophila melanogaster and Caenorhabditis elegans do not exhibit cytosine methylation and consequently do not have CpG rich and poor regions but rather a steady CpG frequency over their genomes [@Deaton2011-pm]. 

### How DNA methylation is set ?
DNA methylation is established by DNA methyltransferases DNMT3A and DNMT3B in combination with DNMT3L and maintained through cell division by the methyltransferase DNMT1 and associated proteins. DNMT3a and DNMT3b are in charge of the de novo methylation during early development. Loss of 5mC can be achieved passively by dilution during replication or exclusion of DNMT1 from the nucleus. Recent discoveries of ten-eleven translocation (TET) family of proteins and their ability to convert 5-methylcytosine (5mC) into 5-hydroxymethylcytosine (5hmC) in vertebrates provide a path for catalysed active DNA demethylation [@Tahiliani2009-ar]. Iterative oxidations of 5hmC catalysed by TET result in 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). 5caC mark is excised from DNA by G/T mismatch-specific thymine-DNA glycosylase (TDG), which as a result returns cytosine residue back to its unmodified state [@He2011-pw]. Apart from these, mainly bacteria but possibly higher eukaryotes contain base modifications on bases other than cytosine, such as methylated adenine or guanine [@Clark2011-sc].

### How to measure DNA methylation with bisulfite-sequencing
One of the most reliable and popular ways to measure DNA methylation is high-throughput bisulfite sequencing. This method, and the related ones, allow measurement of DNA methylation at the single nucleotide resolution. The bisulfite conversion turns unmethylated Cs to Ts and methylated Cs remain intact. Then, the only thing to do is to align the reads with those C->T conversions and count C->T mutations to calculate fraction of methylated bases. In the end, we can get quantitative genome-wide measurements for DNA methylation. 

## Analyzing DNA methylation data
For the remainder of this chapter, we will explain how to do DNA methylation analysis using R. The analysis process is somewhat similar to the analysis patterns observed in other sequencing data analyses. The process can be chunked to four main parts with further sub-chunks:

1. Processing raw data
  - Quality check
  - Alignment and post-alignment processing
  - Methylation calling
  - Filtering bases
2. Exploratory analysis
  - Clustering
  - PCA
3. Finding interesting regions
  - Differential methylation
  - Methylation segmentation 
4. Annotating interesting regions
  - Nearest genes
  - Annotation with other genomic features
  - Integration with other quantitative genomics data

## Processing raw data and getting data into R
The rawest form of data that most users get is probably in the form of fastq files obtained from the sequencing experiments. We will describe necessary steps and the tools that can be used for raw data processing and if exists we will mention their R equivalents. However, the data processing is usually done outside of the R framework and for the following sections we will assume that the data processing is done and our analysis is starting from methylation call files.

Typical data processing step starts with data quality check. The fastq files are first ran through a quality check software that shows the quality of the sequencing run. We would typically use [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this. However, there are several bioconductor packages that could be of use, such as [Rqc](https://bioconductor.org/packages/release/bioc/html/Rqc.html) and [QuasR](https://bioconductor.org/packages/release/bioc/html/QuasR.html). Following the quality check, provided everything is OK, the reads can be aligned to the genome. Before the alignment adapters or low quality ends of reads can be trimmed to increase number of alignments. Low quality ends mostly likely have poor basecalls which will lead to many mismatches. Reads with non-trimmed adapters will also not align to the genome. We would use adapter trimming tools such as [cutadapt]() or [flexbar]() for this purpose, although there are a bunch of them to be chosen from. Following this, reads are aligned to the genome with a bisulfite treatment aware aligner. For our own purposes, we use [Bismark](), however there are other equally accurate aligners, some are reviewed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378906/). In addition, Bioconductor package [QuasR](https://bioconductor.org/packages/release/bioc/html/QuasR.html) can align BS-seq reads within the R framework.

After alignment, we need to call C->T conversions and calculate fraction/percentage of methylation. Most of the time aligners come with auxiliary tools that calculate per base methylation values. Normally, they output a tabular format containing the location of the Cs and methylation value and strand. Within R, [QuasR](https://bioconductor.org/packages/release/bioc/html/QuasR.html) and [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html) can call methylation values from BAM files albeit with some limitations. In essence, these methylation call files can be easily read into R and downstream analysis within R starts from that point. An important quality measure at this stage is to look at conversion rate. This simply means how many unmethylated Cs are converted to Ts. Since we expect, non-CpG methylation to be rare. We can simply count number of C->T conversions in non-CpG context and calculate conversion rate. The best way to do this would be via spike-in sequences where we expect no methylation at all. Since, non-CpG methylation is tissue specific, calculating conversion rate using non-CpG Cs might be misleading in some cases.

## Data filtering and exploratory analysis
We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html)[@Akalin2012-af] package for the bulk of the analysis. 

### Reading methylation call files
A typical methylation call file looks like this:

```{r, eval=TRUE, echo=FALSE}
tab <- read.table( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),header=TRUE,nrows=5)
tab
#knitr::kable(tab)
```


Most of the time bisulfite sequencing experiments have test and control  samples. The test samples can be from a disease tissue while the control  samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving `treatment` vector option. The treatment vector defines the sample groups and it is very important for the differential methylation analysis. For sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, first two files have the sample ids "test1" and "test2" and as determined by treatment vector they belong to the same group. The third and fourth files have sample ids "ctrl1" and "ctrl2" and they belong to the same group as indicated by the treatment vector. We will first get a list of file paths and have a look at the content.


```{r,message=FALSE}
library(methylKit)
file.list=list( system.file("extdata", 
                            "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata",
                            "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control2.myCpG.txt", package = "methylKit") )

file.list
```

As you can see `file.list` variable is a simple list of file paths. Each file contains methylation calls for a given sample. Now, we can read the files with `methRead()` function.
```{r read files}
# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
           sample.id=list("test1","test2","ctrl1","ctrl2"),
           assembly="hg18",
           treatment=c(1,1,0,0),
           context="CpG"
           )
```

tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylkit. In those cases, we have to provide either `pipeline="bismarkCoverage"`  or `pipeline="bismarkCytosineReport"` to `methRead` function. In addition to the options we mentioned above,
any tab separated text file with a generic format can be read in using methylKit, 
such as methylation ratio files from [BSMAP](http://code.google.com/p/bsmap/).
See [here](http://zvfak.blogspot.com/2012/10/how-to-read-bsmap-methylation-ratio.html) for an example.

Before we move on, let us have a look at what kind of information is stored in `myobj`. This is technically a `methylRawList` object, which is essentially a list of `methylRaw` objects. These objects hold
the information for location of Cs, and methylated Cs and unmethylated Cs.
```{r}
## inside the methylRawList object
length(myobj)
head(myobj[[1]])
```

### Further quality check
It is always a good idea to check how the data looks like before proceeding further. For example, the methylation values should have bimodal distribution generally. This can be checked via 
`getMethylationStats` function. Normally, we should see a bimodal
distributions. Strong deviations from the bimodality may be due poor experimental quality, such as problems with bisulfite treatment.
```{r methStats}
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
```

In addition, we might want to see coverage values. By default, methylkit handles bases with at least 10X coverage by that can be changed. The bases with unusually high coverage is usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with 
`getCoverageStats` function. 

```{r coverageStats}
getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

```

It might be useful to filter samples based on coverage. Particularly, if our samples are suffering from PCR bias it would be useful to discard bases with very high read coverage. Furthermore, we would also like to discard bases that have low read coverage, a high enough read coverage will increase the power of the statistical tests. The code below filters a `methylRawList` and discards bases that have coverage below 10X and also discards the bases that have more than 99.9th percentile of coverage in each sample.

```{r filterCovMeth}
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
                                      hi.count=NULL,hi.perc=99.9)
```


### Merging samples into a single table
When we first read the files, each file is stored as its own entity. If we want compare samples in any way, we need to make a unified data structure that contains CpGs that are covered in most samples. The `unite` function creates a new object using the CpGs covered in each sample. This means 
```{r uniteMeth}
## we use :: notation to make sure unite() function from methylKit is called
meth=methylKit::unite(myobj, destrand=FALSE)
```

Let us take a look at the data content of methylBase object:

```{r headMeth}
head(meth)
```

By default, `unite` function produces bases/regions covered in all samples. That requirement can be relaxed using "min.per.group" option in `unite` function.

```{r,eval=FALSE}
# creates a methylBase object, 
# where only CpGs covered with at least 1 sample per group will be returned

# there were two groups defined by the treatment vector, 
# given during the creation of myobj: treatment=c(1,1,0,0)
meth.min=unite(myobj,min.per.group=1L)
```

### Filtering CpGs
We might need to filter the CpGs further before exploratory analysis or even before the downstream analysis such as differential methylation . For exploratory analysis, it is of general interest to see how samples relate to each other and we might want to remove CpGs that are not variable before doing that. Or we might remove Cs that are potentially C->T mutations. First, we show how to
filter based on variation. Below, we extract percent methylation values from CpGs as a matrix. Calculate standard deviation for each CpG and filter based on standard deviation. We also plot the the distribution of per CpG standard deviations.
```{r methVar}
pm=percMethylation(meth) # get percent methylation matrix
mds=matrixStats::rowSds(pm) # calculate standard deviation of CpGs
head(meth[mds>20,])

hist(mds,col="cornflowerblue",xlab="Std. dev. per CpG")
```

Now, let's assume we know the locations of C->T mutations. These locations should be removed from the analysis as they do not represent
bisulfite treatment associated conversions. Mutation locations are 
stored in a `GRanges` object, and we can use that to remove CpGs 
overlapping with mutations. In order to do overlap operation, we will convert the methylKit object to a `GRanges` object and do the filtering with `%over%` function within `[ ]`. The returned object will still be a methylKit object.
```{r snps}
library(GenomicRanges)
mut=GRanges(seqnames=c("chr21","chr21"),
            ranges=IRanges(start=c(9853296, 9853326),
                           end=c( 9853296,9853326)))

# select CpGs that do not overlap with mutations
sub.meth=meth[! as(meth,"GRanges") %over% mut,]
nrow(meth)
nrow(sub.meth)
```

### Clustering samples
Clustering is used for grouping data points by their similarity. It is a general concept that can be achieved by many different algorithms. In the context of DNA methylation we are trying to find samples that are similar to each other. For example, if we sequenced three heart samples and 4 liver samples, we would expect liver samples will be more similar to each other than heart samples on the DNA methylation space. 

The following function will cluster the samples and draw a dendrogram.
It will use correlation distance which is $1-\rho$ , where $\rho$ is the correlation coefficient between two pairs of samples. The cluster tree will be drawn using the "ward" method. This specific variant uses a "bottom up" approach: each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In Ward's method, two clusters are merged if the variance is minimized compared to other possible merge operations. This bottom up approach helps build the dendrogram showing the relationship between clusters. 
```{r}
clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)
```

Setting the `plot=FALSE` will return a dendrogram object which can be manipulated by users or fed in to other user functions that can work with dendrograms.

```{r,message=FALSE}
hc = clusterSamples(meth, dist="correlation", method="ward", plot=FALSE)
```

### Principal component analysis 
Principal component analysis (PCA) is a mathematical transformation of possibly(correlated) variables into a number of uncorrelated variables called principal components. The resulting components from this transformation is defined in such a way that the first principal component has the highest variance and accounts for as most of the variability in the data. The following function will plot a scree plot for importance of components.

```{r}
PCASamples(meth, screeplot=TRUE)
```

We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster within these new dimensions. Similar to clustering dendrogram, we would like to see samples that are similar to be close to each other on the scatter plot. If they are not, it might indicate problems with the experiment such as batch effects.

```{r}
pc=PCASamples(meth,obj.return = TRUE, adj.lim=c(1,1))
```

In this case, we also returned an object from the plotting function. this is the output of R `prcomp` function, which includes loadings and eigen vectors which might be useful. You can also do your own PCA analysis using `percMethylation` and `prcomp`. In the case above, the methylation matrix is transponsed. This allows us to compare distances between samples on the PCA scatterplot. 

## Extracting interesting regions: segmentation and differential methylation
When analyzing DNA methylation data, we usually look for regions that are different than the rest of the methylome or different from a reference methylome. These regions are so called "interesting regions". They usually mark important genomic features that are related to gene regulation which in turn defines the cell type. Therefore, it is a general interest to find such regions and analyze them further to understand our biological sample or to answer specific research questions. Below we will describe two ways of defining "regions of interest".


### Differential methylation
Once methylation proportions per base are obtained, generally, the differences between methylation profiles are considered next. When there are multiple sample groups where each group defines a separate biological entity or treatment, it is usually of interest to locate bases or regions with different methylation proportions across the sample groups. The bases or regions with different methylation proportions across samples are called differentially methylated CpG sites (DMCs) and differentially methylated regions (DMRs). They have been shown to play a role in many different diseases due to their association with epigenetic control of gene regulation. In addition, DNA methylation profiles can be highly tissue-specific due to their role in gene regulation [@Schubeler2015-ai]. DNA methylation is highly informative when studying normal and diseased cells, because it can also act as a biomarker. For example, the presence of large-scale abnormally methylated genomic regions is a hallmark feature of many types of cancers [@Ehrlich2002-hv]. Because of aforementioned reasons, investigating differential methylation is usually one of the primary goals of doing bisulfite sequencing.

#### Fisher's exact test
Differential DNA methylation is usually calculated by comparing the proportion of methylated Cs in a test sample relative to a control. In simple comparisons between such pairs of samples (i.e. test and control), methods such as Fisher’s Exact Test can be used. if there are replicates, replicates can be pooled within groups to a single sample per group. This strategy, however, does not take into account biological variability between replicates. We will now show how to compare pairs of samples via `calculateDiffMeth` function in `methylKit`. When there are only one sample per sample group, `calculateDiffMeth` automatically applies Fisher's exact test. We will not extract one sample from each group and run `calculateDiffMeth`, which will automatically run Fisher's exact test.
```{r fishers,eval=FALSE}
getSampleID(meth)
new.meth=reorganize(meth,sample.ids=c("test1","ctrl1"),treatment=c(1,0))
dmf=calculateDiffMeth(new.meth)
```
As mentioned, we can also pool the samples from the same group by adding up the number of Cs and Ts per group. This way even if we have replicated experiments we treat them as single experiments, and can apply Fisher's exact test. We will now pool the samples and apply
```{r pool}
pooled.meth=pool(meth,sample.ids=c("test","control"))
dm.pooledf=calculateDiffMeth(pooled.meth)

```

`calculateDiffMeth` function returns the P-values for all bases or regions in the input methylBase object. We need to filter to get differentially methylated CpGs. This can be done via `getMethlyDiff` function or simple filtering via `[ ]` notation. Below we show how to filter the `methylDiff` object output by `calculateDiffMeth()` function in order to get differentially methylated CpGs. The function arguments defines cutoff values for  the methylation difference between groups and Q-value. In these cases, we require a methylation difference of 25% and Q-value of at least $0.01$.

```{r filter}
# get differentially methylated bases/regions with specific cutoffs
all.diff=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="all")

# get hyper-methylated
hyper=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hyper")

# get hypo-methylated
hypo=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hypo")

#using [ ] notation
hyper2=dm.pooledf[dm.pooledf$qvalue < 0.01 & dm.pooledf$meth.diff > 25,]
```

#### Logistic regression based tests
Regression-based methods are generally used to model methylation levels in relation to the sample groups and variation between replicates. Differences between currently available regression methods stem from the choice of distribution to model the data and the variation associated with it. In the simplest case, linear regression can be used to model methylation per given CpG or loci across sample groups. The model fits regression coefficients to model the expected methylation proportion values for each CpG site across sample groups. Hence, the null hypothesis of the model coefficients being zero could be tested using t-statistics. However, linear regression based methods might produce fitted methylation levels outside the range [0,1] unless the values are transformed before regression. An alternative is logistic regression, which can deal with data strictly bounded between 0 and 1 and with non-constant variance, such as methylation proportion/fraction values. In the logistic regression, it is assumed that fitted values have variation $np(1-p)$, where $p$ is the fitted methylation proportion for a given sample and n is the read coverage. If the observed variance is larger or smaller than assumed by the model, one speaks of under- or over-dispersion. This over/under-dispersion can be corrected by calculating a scaling factor and using that factor to adjust the variance estimates as in $np(1-p)s$, where $s$ is the scaling factor. MethylKit can apply logistic regression to test the methylation difference with or without the over-dispersion correction. In this case, Chi-square or F-test can be used to compare the difference in the deviances of the null model and the alternative model. The null model assumes there is no relationship between sample groups and methylation, and the alternative model assumes that there is a relationship where sample groups are predictive of methylation values for a given CpG or region for which the model is constructed. Next, we are going to use the logistic regression based model with over-dispersion correction and Chi-square test. 
```{r logReg}
dm.lr=calculateDiffMeth(meth,overdispersion = "MN",test ="Chisq")
```

#### Betabinomial distribution based tests
More complex regression models use beta binomial distribution and are particularly useful for better modeling the variance. Similar to logistic regression, their observation follows binomial distribution (number of reads), but methylation proportion itself can vary across samples, according to a beta distribution. It can deal with fitting values in [0,1] range and performs better when there is greater variance than expected by the simple logistic model. In essence, these models have a different way of calculating a scaling factor when there is over-dispersion in the model. Further enhancements are made to these models by using the Empirical Bayes methods that can better estimate hyper parameters of the beta distribution (variance-related parameters) by borrowing information between loci or regions within the genome to aid with inference about each individual loci or region. We are now going to use a beta-binomial based model called DSS[@Feng2014-pd] to calculate differential methylation.
```{r dss}
dm.dss=calculateDiffMethDSS(meth)
```

#### Differential methylation for regions rather than base-pairs
Until now, we worked on differentially methylated cytosines. However,
working with base-pair resolution data has its problems. Not all the CpGs will be covered in all samples, if covered they may have low coverage which reduces the power of the tests. Instead of base-pairs, we can choose to work with regions. So, it might be desirable to summarize methylation information over pre-defined regions rather than doing base-pair resolution analysis. `methylKit` provides functionality to do such analysis. We can either tile the whole genome to tiles with predefined length, or we can use pre-defined regions such as promoters or CpG islands. This kind of regional analysis is carried out by adding up C and T counts from each covered cytosine and returning a total C and T count for each region.


The function below tiles the genome with windows 1000bp length and 1000bp step-size and summarizes the methylation information on those tiles. In this case, it returns a `methylRawList` object which can be fed into `unite` and `calculateDiffMeth` functions consecutively to get differentially methylated regions.
```{r,warning=FALSE}
tiles=tileMethylCounts(myobj,win.size=1000,step.size=1000)
head(tiles[[1]],3)
```

In addition, if we are interested in particular regions, we can also get those regions as methylKit objects after summarizing the methylation information as described above. The code below summarizes the methylation information over a given set of promoter regions and outputs a `methylRaw` or `methylRawList` object depending on the input. We are using the output of 
`genomation` functions used above to provide the locations of promoters. For regional summary  functions, we need to 
provide regions of interest as GRanges object.

```{r, eval=TRUE}
library(genomation)

# read the gene BED file
gene.obj=readTranscriptFeatures(system.file("extdata", "refseq.hg18.bed.txt", 
                                           package = "methylKit"))
promoters=regionCounts(myobj,gene.obj$promoters)

head(promoters[[1]])
```

In addition, it is possible to cluster DMCs based on their proximity and direction of differential methylation. This can be achieved by `methSeg` function in methylKit. We will see more about `methSeg` function in the following section.
But it can take the output of `getMethylDiff` function therefore can work on DMCs to get differentially methylated regions.

#### Adding covariates
Covariates can be included in the analysis as well in methylKit. The `calculateDiffMeth` function will then try to 
separate the influence of the covariates from the 
treatment effect via the logistic regression model. In this case, we will test
if full model (model with treatment and covariates) is better than the model with
the covariates only. If there is no effect due to the treatment (sample groups),
the full model will not explain the data better than the model with covariates
only. In `calculateDiffMeth`, this is achieved by
supplying the `covariates` argument in the format of a `data.frame`.
Below, we simulate methylation data and add make a `data.frame` for the age.
The data frame can include more columns, and those columns can also be 
`factor` variables. The row order of the data.frame should match the order
of samples in the `methylBase` object. Below we are showing an example
of this using a simulated data set where methylation values of CpGs will be affected by the age of the sample.

```{r,eval=FALSE}

covariates=data.frame(age=c(30,80,34,30,80,40))
sim.methylBase=dataSim(replicates=6,sites=1000,
                        treatment=c(rep(1,3),rep(0,3)),
                        covariates=covariates,
                        sample.ids=c(paste0("test",1:3),paste0("ctrl",1:3)))

my.diffMeth3=calculateDiffMeth(sim.methylBase,
                               covariates=covariates,
                               overdispersion="MN",
                               test="Chisq",mc.cores=1)

                        
```

### Methylation segmentation
The analysis of methylation dynamics is not exclusively restricted to differentially methylated regions across samples, apart from this there is also an interest in examining the methylation profiles within the same sample. Usually, depressions in methylation profiles pinpoint regulatory regions like gene promoters that co-localize with CG-dense CpG islands. On the other hand, many gene-body regions are extensively methylated and CpG-poor [@Bock2012-oh]. These observations would describe a bimodal model of either hyper- or hypomethylated regions dependent on the local density of CpGs [@Lovkvist2016-ky]. However, given the detection of CpG-poor regions with locally reduced levels of methylation (on average 30%) in pluripotent embryonic stem cells and in neuronal progenitors in both mouse and human, a different model seems also reasonable [@Stadler2011-iu]. These low-methylated regions (LMRs) are located distal to promoters, have little overlap with CpG islands and associated with enhancer marks such as p300 binding sites and H3K27ac enrichment. 

Now we are going to try to segment portion for the H1 human embryonic stem cell line. MethylKit uses change-point analysis to segment the methylome. In change-point analysis, the change-points of a genome-wide methylation signal are recorded and the genome is partitioned into regions between consecutive change points. CpGs in each segment is similar to eachoter more than the following segment. 
After segmentation, methylKit function `methSeg` identifies segments that are further clustered into segment classes using a mixture modeling approach. This clustering is based on only the average methylation level of the segments and allows the detection of distinct methylome features comparable to unmethylated regions (UMRs), lowly methylated regions (LMRs) and fully methylated regions (FMRs) mentioned at Stadler et. al (2012). The code snippet below reads the methylation data from H1 cell line as a `GRanges` object, and runs the segmentation with potentially up to classes of segments. Mixture modelling determines the optimal number of segments using a statistic called bayesian information criterion (BIC). BIC is a statistic based on model likelihood and helps us select the model that fits the data better. We have set the number of segment classes to try using `G=1:4` argument.The `minSeg` arguments are related to minimum number of CpGs in the segments.
```{r }
# read methylation data

methFile=system.file("extdata","H1.chr21.chr22.rds",
                     package="compGenomRData")
mbw=readRDS(methFile)

# segment the methylation data
res=methSeg(mbw,minSeg=10,G=1:4,
            join.neighbours = TRUE)

```

In this case, we know that BIC does not improve much after 4 segment classes. Now, we will not have a look at the characteristics of the segment classes. We are going to plot mean methylation value and the length of the segment as a scatter plot.
```{r segplot}
# plot 
plot(res$seg.mean,
     log10(width(res)),pch=20,
     col=scales::alpha(rainbow(4)[as.numeric(res$seg.group)], 0.2),
     ylab="log10(length)",
     xlab="methylation proportion")

```

The highly methylated segment classes that have more than 70% methylation are usually longer, median length is 17889 bp. The segment class that has the lowest methylation values have median length of 1376 bp and the shortest segment class has low to medium methylation level, has median length of 412 bp.


### Working with large files 
We might want to perform differential methylation analysis in R using whole genome methylation data of multiple samples. The problem is that for genome-wide experiments, file sizes can easily range from hundreds of megabytes to gigabytes and processing multiple instances of those files in memory (RAM) might become unfeasible unless we have access to a high performance cluster (HPC) with extensive RAM. If we want to use a desktop computer or laptop with limited RAM, we either need to restrict our analysis to a subset of the data or use packages that can handle this situation. 

The methylKit package provides the capability of dealing large files and high number of samples by exploiting flat file databases to substitute in-memory objects. The internal data apart from meta information has a tabular structure storing chromosome, start/end position, strand information of the associated CpG base just like many other biological formats like BED, GFF or SAM. By exporting this tabular data into a TAB-delimited file and making sure it is accordingly position-sorted it can be indexed using the generic [tabix tool](http://www.htslib.org/doc/tabix.html). In general tabix indexing is a generalization of BAM indexing for generic TAB-delimited files. It inherits all the advantages of BAM indexing, including data compression and efficient random access in terms of few seek function calls per query [@Li2011-wc]. MethylKit relies on [Rsamtools] (http://bioconductor.org/packages/release/bioc/html/Rsamtools.html) which implements tabix functionality for R and this way internal methylKit objects can be efficiently stored as compressed file on the disk and still be fast accessed. Another advantage is that existing compressed files can be loaded in interactive sessions, allowing the backup and transfer of intermediate analysis results.

methylKit provides the capability for storing objects in tabix format within various functions. Every methylKit object has their tabix-based flat-file database equivalent. For example, when reading a methylation call file the `dbtype` argument can be provided, this will create tabix based objects.
```{r tabix,eval=FALSE}
 myobj=methRead( file.list,
               sample.id=list("test1","test2","ctrl1","ctrl2"),
               assembly="hg18",treatment=c(1,1,0,0),
               dbtype="tabix") 
```
The advantage of tabix-based objects is of course saving memory and more efficient parallelization for differential methylation calculation. However, since the data is written to a file and indexed whenever a new object is created, working with tabix-based objects will be slower at certain steps of the analysis compared to in-memory objects.

## Annotation of DMRs/DMCs and segments
The regions of interest obtained through differential methylation or segmentation analysis often need to be integrated with genome annotation datasets. Without this type of integration, differential methylation or segmentation results will be hard to interpret in biological terms. The most common annotation task is to see where regions of interest land in relation to genes and gene parts and regulatory regions: Do they mostly occupy promoter, intronic or exonic regions? Do they overlap with repeats? Do they overlap with other epigenomic markers or long-range regulatory regions? These questions are not specific to methylation −nearly all regions of interest obtained via genome-wide studies have to deal with such questions. Thus, there are already multiple software tools that can produce such annotations. One is the Bioconductor package [genomation](http://bioconductor.org/packages/release/bioc/html/genomation.html)[@Akalin2015-yk]. It can be used to annotate DMRs/DMCs and it can also be used to integrate methylation proportions over the genome with other quantitative information and produce meta-gene plots or heatmaps. Below, we are reading a BED file for transcripts and using that to annotate DMCs with promoter/intron/exon/intergenic annotation.`genomation::readTranscriptFeatures` function reads a BED12 file, calculates the coordinates of promoters, exons and introns and the subsequent function uses that information for annotation.

```{r}
library(genomation)

# read the gene BED file
transcriptBED=system.file("extdata", "refseq.hg18.bed.txt", 
                                           package = "methylKit")
gene.obj=readTranscriptFeatures(transcriptBED)
#
# annotate differentially methylated CpGs with 
# promoter/exon/intron using annotation data
#
annotateWithGeneParts(as(all.diff,"GRanges"),gene.obj)
```

Similarly, we can read the CpG island annotation and annotate our differentially methylated bases/regions with them.

```{r}
# read the shores and flanking regions and name the flanks as shores 
# and CpG islands as CpGi
cpg.file=system.file("extdata", "cpgi.hg18.bed.txt", 
                                        package = "methylKit")
cpg.obj=readFeatureFlank(cpg.file,
                           feature.flank.name=c("CpGi","shores"))
#
# convert methylDiff object to GRanges and annotate
diffCpGann=annotateWithFeatureFlank(as(all.diff,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")
```

## Other R packages that can be used for methylation analysis
- [DSS](http://bioconductor.org/packages/release/bioc/html/genomation.html) beta-binomial models with Empirical Bayes for moderating dispersion.
- [BSseq](http://bioconductor.org/packages/release/bioc/html/BSseq.html) Regional differential methylation analysis using smoothing and linear regression based tests.
- [BiSeq](http://bioconductor.org/packages/release/bioc/html/BiSeq.html) Regional differential methylation analysis using beta-binomial models
- [MethylSeekR](http://bioconductor.org/packages/release/bioc/html/MethylSeekR.html): Methylome segmentation using HMM and cutoffs.
- [QuasR](http://bioconductor.org/packages/release/bioc/html/QuasR.html): Methylation aware alignment and methylation calling. As well as fastQC-like fastq raw data quality check features.

## Exercises

### Exercise 1
The main objective of this exercise is getting differential methylated cytosines between two groups of samples: IDH-mut (AML patients with IDH mutations) vs NBM (normal bone marrow samples).

- Download methylation call files from GEO. These files are readable by methlKit using default `methRead` arguments. 

samples     Link      
-------     ------  
IDH1_rep1    [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919990&format=file&file=GSM919990%5FIDH%2Dmut%5F1%5FmyCpG%2Etxt%2Egz) 
IDH1_rep2    [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919991&format=file&file=GSM919991%5FIDH%5Fmut%5F2%5FmyCpG%2Etxt%2Egz) 
NBM_rep1     [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919982&format=file&file=GSM919982%5FNBM%5F1%5FmyCpG%2Etxt%2Egz)
NBM_rep2     [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919984&format=file&file=GSM919984%5FNBM%5F2%5FRep1%5FmyCpG%2Etxt%2Egz)


- Find differentially methylated cytosines. Use chr1 and chr2 only if you need to save time. You can subset it after you download the files either in R or unix. The files are for hg18 assembly of human genome.
- Describe the differential methylation trend, what is the main effect ? 
- Annotate differentially methylated cytosines (DMCs) promoter/intron/exon ?
- Which genes are the nearest to DMCs ? Can you do gene set analysis either in R or via web-based tools ?

Example code for reading a file:
```{r, eval=FALSE}
library(methylKit)
m=methRead("~/Downloads/GSM919982_NBM_1_myCpG.txt.gz",sample.id = "idh",assembly="hg18")
```

### Exercise 2
The main objective of this exercise is to learn how to do methylome segmentation and the downstream analysis for annotation and data integration.

- Download the human embryonic stem-cell (H1 Cell Line) methylation bigWig files from [Roadmap Epigenomics website](http://egg2.wustl.edu/roadmap/web_portal/processed_data.html#MethylData). It may take a while to understand how the website is structured and which bigWig file to use. That is part of the exercise. The files you will download are for hg19 assembly unless stated otherwise.
- Do segmentation on hESC methylome. you can only do chr1 if it takes too much time. 
- Annotate segments, what kind of gene-based features each segment class overlaps with (promoter/exon/intron)
- For each segment type, annotate the segments with chromHMM annotations from Roadmap Epigenome database available (here)[https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state], the specific file you should use is (here)[https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz]. This is a bed file with chromHMM annotations. chromHMM annotations are parts of the genome identified by a hidden-markov-model based machine-learning algorithm. The segments correspond to active promoters, enhancers, active transcription, insulators. etc. The chromHMM model uses histone modification ChIP-seq and potentially other ChIP-seq data sets to annotate the genome.

## References