forked from compgenomr/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-intro2R.Rmd
1325 lines (1009 loc) · 51.8 KB
/
02-intro2R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Introduction to R for Genomic Data Analysis {#Rintro}
The aim of computational genomics is to provide biological interpretation and insights from high
dimensional genomics data. Generally speaking, it is similar to any other kind
of data analysis endeavor but often times doing computational genomics will require domain specific knowledge and tools.
As new high-throughput experimental techniques are on the rise, data analysis
capabilities are sought-after features for researchers. The aim of this chapter is to first familiarize the readers with data analysis steps and then provide basics of R programming within the context of genomic data analysis. R is a free statistical programming language that is popular among researchers and data miners to build software and analyze data. Although
basic R programming tutorials are easily accessible, we are aiming to introduce
the subject with the genomic context in the background. The examples and
narrative will always be from real-life situations when you try to analyze
genomic data with R. We believe tailoring material to the context of genomics
makes a difference when learning this programming language for sake of analyzing
genomic data.
## Steps of (genomic) data analysis
Regardless of the analysis type, the data analysis has a common pattern. We will
discuss this general pattern and how it applies to genomics problems. The data analysis steps typically include data collection, quality check and cleaning, processing, modeling, visualization and reporting. Although, one expects to go through these steps in a linear fashion, it is normal to go back and repeat the steps with different parameters or tools. In practice, data analysis requires going through the same steps over and over again in order to be able to do a combination of the following: a) answering other related questions, b) dealing with data quality issues that are later realized, and, c) including new data sets to the analysis.
We will now go through a brief explanation of the steps within the context of genomic data analysis.
### Data collection
Data collection refers to any source, experiment or survey that provides data for the data analysis question you have. In genomics, data collection is done by high-throughput assays introduced in chapter \@ref(intro). One can also use publicly available data sets and specialized databases also mentioned in chapter \@ref(intro). How much data and what type of data you should collect depends on the question you are trying to answer and the technical and biological variability of the system you are studying.
### Data quality check and cleaning
In general, data analysis almost always deals with imperfect data. It is
common to have missing values or measurements that are noisy. Data quality check
and cleaning aims to identify any data quality issue and clean it from the dataset.
High-throughput genomics data is produced by technologies that could embed
technical biases into the data. If we were to give an example from sequencing,
the sequenced reads do not have the same quality of bases called. Towards the
ends of the reads, you could have bases that might be called incorrectly. Identifying those low quality bases and removing them will improve read mapping step.
### Data processing
This step refers to processing the data to a format that is suitable for
exploratory analysis and modeling. Often times, the data will not come in ready
to analyze format. You may need to convert it to other formats by transforming
data points (such as log transforming, normalizing etc), or subset the data set
with some arbitrary or pre-defined condition. In terms of genomics, processing
includes multiple steps. Following the sequencing analysis example above,
processing will include aligning reads to the genome and quantification over genes or regions of interest. This is simply counting how many reads are covering your regions of interest. This quantity can give you ideas about how much a gene is expressed if your experimental protocol was RNA sequencing. This can be followed by some normalization to aid the next step.
### Exploratory data analysis and modeling
This phase usually takes in the processed or semi-processed data and applies machine-learning or statistical methods to explore the data. Typically, one needs to see relationship between variables measured, relationship between samples based on the variables measured. At this point, we might be looking to see if the samples group as expected by the experimental design, are there outliers or any other anomalies ? After this step you might want to do additional clean up or re-processing to deal with anomalies.
Another related step is modeling. This generally refers to modeling your variable of interest based on other variables you measured. In the context of genomics, it could be that you are trying to predict disease status of the patients from expression of genes you measured from their tissue samples. Then your variable of interest is the disease status and . This is generally called predictive modeling and could be solved with regression based or any other machine-learning methods. This kind of approach is generally called "predictive modeling".
Statistical modeling would also be a part of this modeling step, this can cover predictive modeling as well where we use statistical methods such as linear regression. Other analyses such as hypothesis testing, where we have an expectation and we are trying to confirm that expectation is also related to statistical modeling. A good example of this in genomics is the differential gene expression analysis. This can be formulated as comparing two data sets, in this case expression values from condition A and condition B, with the expectation that condition A and condition B has similar expression values. You will see more on this in chapter \@ref(stats).
### Visualization and reporting
Visualization is necessary for all the previous steps more or less. But in the final phase, we need final figures, tables and text that describes the outcome of your analysis. This will be your report. In genomics, we use common data visualization methods as well as specific visualization methods developed or popularized by genomic data analysis. You will see many popular visualization methods in chapters \@ref(stats) and \@ref{genomicIntervals}
### Why use R for genomics ?
R, with its statistical analysis
heritage, plotting features and rich user-contributed packages is one of the
best languages for the task of analyzing genomic data.
High-dimensional genomics datasets are usually suitable to
be analyzed with core R packages and functions. On top of that, Bioconductor and CRAN have an
array of specialized tools for doing genomics-specific analysis. Here is a list of computational genomics tasks that can be completed using R.
#### Data cleanup and processing
Most of general data clean up, such as removing incomplete columns and values, reorganizing and transforming data, these tasks can be achieved using R. In addition, with the help of packages R can connect to databases in various formats such as mySQL, mongoDB, etc., and query and get the data to R environment using database specific tools.
On top of these, genomic data specific processing and quality check can be achieved via R/Bioconductor packages. For example, sequencing read quality checks and even HT-read alignments can be achieved via R packages.
#### General data anaylsis and exploration
Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R.
- unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization
(PCA, ICA etc.)
- supervised data analysis: generalized linear models, support vector machines, random forests
#### Genomics-specific data analysis methods
R/Bioconductor gives you access to multitude of other bioinformatics specific algorithms. Here are some of the things you can do.
- Sequence analysis: TF binding motifs, GC content and CpG counts of a given DNA sequence
- Differential expression (or arrays and sequencing based measurements)
- Gene set/Pathway analysis: What kind of genes are enriched in my gene set
- Genomic Interval operations such as Overlapping CpG islands with transcription start sites, and filtering based on overlaps
- Overlapping aligned reads with exons and counting aligned reads per gene
#### Visualization
Visualization is an important part of all data analysis techniques including computational genomics. Again, you can use core visualization techniques in R and also genomics specific ones with the help of specific packages. Here are some of the things you can do with R.
- Basic plots: Histograms, scatter plots, bar plots, box plots, heatmaps
- ideograms and circos plots for genomics provides visualization of different features over the whole genome.
- meta-profiles of genomic features, such as read enrichment over all promoters
- Visualization of quantitative assays for given locus in the genome
## Getting started with R
Download and install R http://cran.r-project.org/ and RStudio http://www.rstudio.com/ if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R.
You will need specific data sets to run the codes in this document. Download the data.zip[URL to come] and extract it to your directory of choice. The folder name should be “data” and your R working directory should be level above the data folder. That means in your R console, when you type “dir(“data”)” you should be able to see the contents of the data folder. You can change your working directory by *setwd()* command and get your current working directory with *getwd()* command in R. In RStudio, you can click on the top menu and change the location of your working directory via user interface.
### Installing packages
R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology related packages. However main package repository of R, called CRAN, has also computational biology related packages. In addition, R-Forge(http://r-forge.r-project.org/), GitHub(https://github. com/), and googlecode(http://code.google.com) are other locations where R packages might be hosted.
You can install CRAN packages using install.packages(). (# is the comment character in R)
```{r,eval=FALSE}
# install package named "randomForests" from CRAN
install.packages("randomForests")
```
You can install bioconductor packages with a specific installer script
```{r,eval=FALSE}
# get the installer package
source("http://bioconductor.org/biocLite.R")
# install bioconductor package "rtracklayer"
biocLite("rtracklayer")
```
You can install packages from github using *install_github()* function from **devtools**
```{r,eval=FALSE}
library(devtools)
install_github("hadley/stringr")
```
Another way to install packages are from the source.
```{r,eval=FALSE}
# download the source file
download.file("http://goo.gl/3pvHYI",
destfile="methylKit_0.5.7.tar.gz")
# install the package from the source file
install.packages("methylKit_0.5.7.tar.gz",
repos=NULL,type="source")
# delete the source file
unlink("methylKit_0.5.7.tar.gz")
```
You can also update CRAN and Bioconductor packages.
```{r,eval=FALSE}
# updating CRAN packages
update.packages()
# updating bioconductor packages
source("http://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
```
### Installing packages in custom locations
If you will be using R on servers or computing clusters rather than your personal computer it is unlikely that you will have administrator access to install packages. In that case, you can install packages in custom locations by telling R where to look for additional packages. This is done by setting up an .Renviron file in your home directory and add the following line:
```
R_LIBS=~/Rlibs
```
This tells R that “Rlibs” directory at your home directory will be the first choice of locations to look for packages and install packages (The directory name and location is up to you above is just an example). You should go and create that directory now. After that, start a fresh R session and start installing packages. From now on, packages will be installed to your local directory where you have read-write access.
### Getting help on functions and packages
You can get help on functions by help() and help.search() functions. You can list the functions in a package with ls() function
```{r,eval=FALSE}
library(MASS)
ls("package:MASS") # functions in the package
ls() # objects in your R enviroment
# get help on hist() function
?hist
help("hist")
# search the word "hist" in help pages
help.search("hist")
??hist
```
#### More help needed?
In addition, check package vignettes for help and practical understanding of the functions. All Bionconductor packages have vignettes that walk you through example analysis. Google search will always be helpful as well, there are many blogs and web pages that have posts about R. R-help, Stackoverflow and R-bloggers are usually source of good and reliable information.
## Computations in R
R can be used as an ordinary calculator, some say it is an over-grown calculator. Here are some examples. Remember **#** is the comment character. The comments give details about the operations in case they are not clear.
```r{}
2 + 3 * 5 # Note the order of operations.
log(10) # Natural logarithm with base e
5^2 # 5 raised to the second power
3/2 # Division
sqrt(16) # Square root
abs(3-7) # Absolute value of 3-7
pi # The number
exp(2) # exponential function
# This is a comment line
```
## Data structures
R has multiple data structures. If you are familiar with excel you can think of a single excel sheet as a table and data structures as building blocks of that table. Most of the time you will deal with tabular data sets or you will want to transform your raw data to a tabular data set, and you will try to manipulate this tabular data set in some way. For example, you may want to take sub-sections of the table or extract all the values in a column. For these and similar purposes, it is essential to know what are the common data structures in R and how they can be used. R deals with named data structures, this means you can give names to data structures and manipulate or operate on them using those names. It will be clear soon what we mean by this if "named data structures" does not ring a bell.
### Vectors
Vectors are one the core R data structures. It is basically a list of elements of the same type (numeric,character or logical). Later you will see that every column of a table will be represented as a vector. R handles vectors easily and intuitively. You can create vectors with c() function, however that is not the only way. The operations on vectors will propagate to all the elements of the vectors.
```{r}
x<-c(1,3,2,10,5) #create a vector named x with 5 components
x = c(1,3,2,10,5)
x
y<-1:5 #create a vector of consecutive integers y
y+2 #scalar addition
2*y #scalar multiplication
y^2 #raise each component to the second power
2^y #raise 2 to the first through fifth power
y #y itself has not been unchanged
y<-y*2
y #it is now changed
r1<-rep(1,3) # create a vector of 1s, length 3
length(r1) #length of the vector
class(r1) # class of the vector
a<-1 # this is actually a vector length one
```
### Matrices
A matrix refers to a numeric array of rows and columns. You can think of it as a stacked version of vectors where each row or column is a vector. One of the easiest ways to create a matrix is to combine vectors of equal length using *cbind()*, meaning 'column bind'.
```{r}
x<-c(1,2,3,4)
y<-c(4,5,6,7)
m1<-cbind(x,y);m1
t(m1) # transpose of m1
dim(m1) # 2 by 5 matrix
```
You can also directly list the elements and specify the matrix:
```{r}
m2<-matrix(c(1,3,2,5,-1,2,2,3,9),nrow=3)
m2
```
Matrices and the next data structure **data frames** are tabular data structures. You can subset them using **[]** and providing desired rows and columns to subset. Figure \@ref(fig:slicingDataFrames) shows how that works conceptually.
```{r,slicingDataFrames,fig.cap="slicing/subsetting of a matrix and a data frame",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/slicingDataFrames.png" )
```
### Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). A data frame can be constructed by data.frame() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.
```{r}
chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- data.frame(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
mydata <- data.frame(chr=chr,start=start,end=end,strand=strand)
mydata
```
There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that has a value in a column larger than your threshold.
```{r}
mydata[,2:4] # columns 2,3,4 of data frame
mydata[,c("chr","start")] # columns chr and start from data frame
mydata$start # variable start in the data frame
mydata[c(1,3),] # get 1st and 3rd rows
mydata[mydata$start>400,] # get all rows where start>400
```
### Lists
An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.
```{r}
# example of a list with 4 components
# a string, a numeric vector, a matrix, and a scalar
w <- list(name="Fred",
mynumbers=c(1,2,3),
mymatrix=matrix(1:4,ncol=2),
age=5.3)
w
```
You can extract elements of a list using the **[[]]** convention using either its position in the list or its name.
```{r}
w[[3]] # 3rd component of the list
w[["mynumbers"]] # component named mynumbers in list
w$age
```
### Factors
Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models.
```{r}
features=c("promoter","exon","intron")
f.feat=factor(features)
```
Important thing to note is that when you are reading a data.frame with read.table() or creating a data frame with **data.frame()** character columns are stored as factors by default, to change this behavior you need to set **stringsAsFactors=FALSE** in **read.table()** and/or **data.frame()** function arguments.
## Data types
There are four common data types in R, they are **numeric**, **logical**, **character** and **integer**. All these data types can be used to create vectors natively.
```{r}
#create a numeric vector x with 5 components
x<-c(1,3,2,10,5)
x
#create a logical vector x
x<-c(TRUE,FALSE,TRUE)
x
# create a character vector
x<-c("sds","sd","as")
x
class(x)
# create an integer vector
x<-c(1L,2L,3L)
x
class(x)
```
## Reading and writing data
Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is BED format used primarily by UCSC genome browser but most other genome browsers and tools will support BED format. We have all the annotation data in BED format. In R, you can easily read tabular format data with read.table() function.
```{r}
enhancerFilePath=system.file("extdata",
"subset.enhancers.hg18.bed",
package="compGenomRData")
cpgiFilePath=system.file("extdata",
"subset.cpgi.hg18.bed",
package="compGenomRData")
# read enhancer marker BED file
enh.df <- read.table(enhancerFilePath, header = FALSE)
# read CpG island BED file
cpgi.df <- read.table(cpgiFilePath, header = FALSE)
# check first lines to see how the data looks like
head(enh.df)
head(cpgi.df)
```
You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using write.table() function. Now let us write out cpgi.df, we will write it out as a tab-separated file, pay attention to the arguments.
```{r,tidy=FALSE,eval=FALSE}
write.table(cpgi.df,file="cpgi.txt",quote=FALSE,
row.names=FALSE,col.names=FALSE,sep="\t")
```
You can save your R objects directly into a file using save() and saveRDS() and load them back in with load() and readRDS(). By using these functions you can save any R object whether or not they are in data frame or matrix classes.
```{r,eval= FALSE}
save(cpgi.df,enh.df,file="mydata.RData")
load("mydata.RData")
# saveRDS() can save one object at a type
saveRDS(cpgi.df,file="cpgi.rds")
x=readRDS("cpgi.rds")
head(x)
```
One important thing is that with `save()` you can save many objects at a time and when they are loaded into memory with `load(` they retain their variable names. For example, in the above code when you use `load("mydata.RData")` in a fresh R session, an object names “cpg.df” will be created. That means you have to figure out what name you gave it to the objects before saving them. On the contrary to that, when you save an object by `saveRDS()` and read by `readRDS()` the name of the object is not retained, you need to assign the output of `readRDS()` to a new variable (“x” in the above code chunk).
## Plotting in R
R has great support for plotting and customizing plots. We will show only a few below. Let us sample 50 values from normal distribution and plot them as a histogram.
```{r,out.width='65%',fig.width=6.5}
# sample 50 values from normal distribution
# and store them in vector x
x<-rnorm(50)
hist(x) # plot the histogram of those values
```
We can modify all the plots by providing certain arguments to the plotting function. Now let's give a title to the plot using **'main'** argument. We can also change the color of the bars using **'col'** argument. You can simply provide the name of the color. Below, we are using **'red'** for the color. See Figure below for the result this chunk.
```{r,out.width='65%',fig.width=6.5}
hist(x,main="Hello histogram!!!",col="red")
```
Next, we will make a scatter plot. Scatter plots are one the most common plots you will encounter in data analysis. We will sample another set of 50 values and plotted those against the ones we sampled earlier. Scatterplot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those that in following chapters. The scatter plot from our sampling experiment is shown in the figure. Notice that, in addition to main we used **“xlab”** and **“ylab”** arguments to give labels to the plot. You can customize the plots even more than this. See **?plot** and **?par** for more arguments that can help you customize the plots.
```{r,out.width='65%',fig.width=5.5}
# randomly sample 50 points from normal distribution
y<-rnorm(50)
#plot a scatter plot
# control x-axis and y-axis labels
plot(x,y,main="scatterplot of random samples",
ylab="y values",xlab="x values")
```
we can also plot boxplots for vectors x and y. Boxplots depict groups of numerical data through their quartiles. The edges of the box denote 1st and 3rd quartile, and the line that crosses the box is the median. Whiskers usually are defined using interquantile range:
*lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q1+1.5[IQR]*
In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers.
```{r,out.width='65%',fig.width=5.5}
boxplot(x,y,main="boxplots of random samples")
```
Next up is bar plot which you can plot by **barplot()** function. We are going to plot four imaginary percentage values and color them with two colors, and this time we will also show how to draw a legend on the plot using **legend()** function.
```{r,,out.width='65%',fig.width=5.5,tidy=FALSE}
perc=c(50,70,35,25)
barplot(height=perc,
names.arg=c("CpGi","exon","CpGi","exon"),
ylab="percentages",main="imagine %s",
col=c("red","red","blue","blue"))
legend("topright",legend=c("test","control"),
fill=c("red","blue"))
```
## Saving plots
If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following:
1. Open a graphics device
2. Create the plot
3. Close the graphics device
```{r,eval=FALSE}
pdf("mygraphs/myplot.pdf",width=5,height=5)
plot(x,y)
dev.off()
```
Alternatively, you can first create the plot then copy the plot to a graphic device.
```{r,eval=FALSE}
plot(x,y)
dev.copy(pdf,"mygraphs/myplot.pdf",width=7,height=5)
dev.off()
```
## Functions and control structures (for, if/else etc.)
### User defined functions
Functions are useful for transforming larger chunks of code to re-usable pieces of code. Generally, if you need to execute certain tasks with variable parameters then it is time you write a function. A function in R takes different arguments and returns a definite output, much like mathematical functions. Here is a simple function takes two arguments, x and y, and returns the sum of their squares.
```{r}
sqSum<-function(x,y){
result=x^2+y^2
return(result)
}
# now try the function out
sqSum(2,3)
```
Functions can also output plots and/or messages to the terminal. Here is a function that prints a message to the terminal:
```{r}
sqSumPrint<-function(x,y){
result=x^2+y^2
cat("here is the result:",result,"\n")
}
# now try the function out
sqSumPrint(2,3)
```
Sometimes we would want to execute a certain part of the code only if certain condition is satisfied. This condition can be anything from the type of an object (Ex: if object is a matrix execute certain code), or it can be more complicated such as if object value is between certain thresholds. Let us see how they can be used3. They can be used anywhere in your code, now we will use it in a function.
```{r,eval=FALSE}
cpgi.df <- read.table("intro2R_data/data/subset.cpgi.hg18.bed", header = FALSE)
# function takes input one row
# of CpGi data frame
largeCpGi<-function(bedRow){
cpglen=bedRow[3]-bedRow[2]+1
if(cpglen>1500){
cat("this is large\n")
}
else if(cpglen<=1500 & cpglen>700){
cat("this is normal\n")
}
else{
cat("this is short\n")
}
}
largeCpGi(cpgi.df[10,])
largeCpGi(cpgi.df[100,])
largeCpGi(cpgi.df[1000,])
```
### Loops and looping structures in R
When you need to repeat a certain task or a execute a function multiple times, you can do that with the help of loops. A loop will execute the task until a certain condition is reached. The loop below is called a “for-loop” and it executes the task sequentially 10 times.
```{r}
for(i in 1:10){ # number of repetitions
cat("This is iteration") # the task to be repeated
print(i)
}
```
The task above is a bit pointless, normally in a loop, you would want to do something meaningful. Let us calculate the length of the CpG islands we read in earlier. Although this is not the most efficient way of doing that particular task, it serves as a good example for looping. The code below will be execute hundred times, and it will calculate the length of the CpG islands for the first 100 islands in
the data frame (by subtracting the end coordinate from the start coordinate).
**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it for thousands of times.
```{r}
# this is where we will keep the lenghts
# for now it is an empty vector
result=c()
# start the loop
for(i in 1:100){
#calculate the length
len=cpgi.df[i,3]-cpgi.df[i,2]+1
#append the length to the result
result=c(result,len)
}
# check the results
head(result)
```
#### apply family functions instead of loops
R has other ways of repeating tasks that tend to be more efficient than using loops. They are known as the **“apply”** family of functions, which include *apply, lapply, mapply and tapply* (and some other variants). All of these functions apply a given function to a set of instances and returns the result of those functions for each instance. The differences between them is that they take different type of inputs. For example apply works on data frames or matrices and applies the function on each row or column of the data structure. *lapply* works on lists or vectors and applies a function which takes the list element as an argument. Next we will demonstrate how to use **apply()** on a matrix. The example applies the sum function on the rows of a matrix, it basically sums up the values on each row of the matrix, which is conceptualized in Figure \@ref(fig:applyConcept).
```{r,applyConcept,fig.cap="apply concept in R",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/apply.png" )
```
```{r}
mat=cbind(c(3,0,3,3),c(3,0,0,0),c(3,0,0,3),c(1,1,0,0),c(1,1,1,0),c(1,1,1,0))
result<-apply(mat,1,sum)
result
# OR you can define the function as an argument to apply()
result<-apply(mat,1,function(x) sum(x))
result
```
Notice that we used a second argument which equals to 1, that indicates that rows of the matrix/ data frame will be the input for the function. If we change the second argument to 2, this will indicate that columns should be the input for the function that will be applied. See Figure \@ref(fig:applyConcept2) for the visualization of apply() on columns.
```{r,applyConcept2,fig.cap="apply function on columns",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/apply2.png" )
```
```{r}
result<-apply(mat,2,sum)
result
```
Next, we will use **lapply()**, which applies a function on a list or a vector. The function that will be applied is a simple function that takes the square of a given number.
```{r}
input=c(1,2,3)
lapply(input,function(x) x^2)
```
**mapply()** is another member of apply family, it can apply a function on an unlimited set of vectors/lists, it is like a version of lapply that can handle multiple vectors as arguments. In this case, the argument to the mapply() is the function to be applied and the sets of parameters to be supplied as arguments of the function. This conceptualized Figure \@ref(fig:mapplyConcept), the function to be applied is a function that takes to arguments and sums them up. The arguments to be summed up are in the format of vectors, Xs and Ys. `mapply()` applies the summation function to each pair in Xs and Ys vector. Notice that the order of the input function and extra arguments are different for mapply.
```{r,mapplyConcept,fig.cap="mapply concept",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/mapply.png" )
```
```{r}
Xs=0:5
Ys=c(2,2,2,3,3,3)
result<-mapply(function(x,y) sum(x,y),Xs,Ys)
result
```
#### apply family functions on multiple cores
If you have large data sets apply family functions can be slow (although probably still better than for loops). If that is the case, you can easily use the parallel versions of those functions from parallel package. These functions essentially divide your tasks to smaller chunks run them on separate CPUs and merge the results from those parallel operations. This concept is visualized at Figure below , `mcapply` runs the summation function on three different processors. Each processor executes the summation function on a part of the data set, and the results are merged and returned as a single vector that has the same order as the input parameters Xs and Ys.
```{r,mcapplyConcept,fig.cap="mcapplyconcept",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/mcmapply.png" )
```
#### Vectorized Functions in R
The above examples have been put forward to illustrate functions and loops in R because functions using sum() are not complicated and easy to understand. You will probably need to use loops and looping structures with more complicated functions. In reality, most of the operations we used do not need the use of loops or looping structures because there are already vectorized functions that can achieve the same outcomes, meaning if the input arguments are R vectors the output will be a vector as well, so no need for loops or vectorization.
For example, instead of using mapply() and sum() functions we can just use + operator and sum up Xs and Ys.
```{r}
result=Xs+Ys
result
```
In order to get the column or row sums, we can use the vectorized functions colSums() and rowSums().
```{r}
colSums(mat)
rowSums(mat)
```
However, remember that not every function is vectorized in R, use the ones that are. But sooner or later, apply family functions will come in handy.
## Exercises
### Computations in R
1. Sum 2 and 3, use +
```{r,eval=FALSE,echo=FALSE}
2+3
```
2. Take the square root of 36, use sqrt()
```{r,echo=FALSE,eval=FALSE}
sqrt(36)
```
3. Take the log10 of 1000, use function `log10()`
```{r,echo=FALSE,eval=FALSE}
log10(1000)
```
4. Take the log2 of 32, use function `log2()`
```{r,echo=FALSE,eval=FALSE}
log2(32)
```
5. Assign the sum of 2,3 and 4 to variable x
```{r,echo=FALSE,eval=FALSE}
x = 2+3+4
x <- 2+3+4
```
6. Find the absolute value of `5 - 145` using `abs()` function
```{r,echo=FALSE,eval=FALSE}
abs(5-145)
```
7. Calculate the square root of 625, divide it by 5 and assign it to variable `x`.
Ex: `y= log10(1000)/5`, the previous statement takes log10 of 1000, divides it
by 5 and assigns the value to variable y
```{r,echo=FALSE,eval=FALSE}
x = sqrt(625)/5
```
8. Multiply the value you get from previous exercise with 10000, assign it variable x
Ex: `y=y*5`, multiplies y with 5 and assigns the value to y.
**KEY CONCEPT:** results of computations or arbitrary values can be stored in variables we can re-use those variables later on and over-write them with new values
```{r,echo=FALSE,eval=FALSE}
x2 = x*10000
```
### Data structures in R
10. Make a vector of 1,2,3,5 and 10 using `c()`, assign it to `vec` variable.
Ex: `vec1=c(1,3,4)` makes a vector out of 1,3,4.
```{r,echo=FALSE,eval=FALSE}
c(1:5,10)
vec1=c(1,2,3,4,5,10)
```
11. Check the length of your vector with length().
Ex: `length(vec1)` should return 3
```{r,echo=FALSE,eval=FALSE}
length(vec1)
```
12. Make a vector of all numbers between 2 and 15.
Ex: `vec=1:6` makes a vector of numbers between 1 and 6, assigns to vec variable
```{r,echo=FALSE,eval=FALSE}
vec=2:15
```
13. Make a vector of 4s repeated 10 times using `rep()` function. Ex: `rep(x=2,times=5)` makes a vector of 2s repeated 5 times
```{r,echo=FALSE,eval=FALSE}
rep(x=4,times=10)
rep(4,10)
```
14. Make a logical vector with TRUE, FALSE values of length 4, use `c()`.
Ex: `c(TRUE,FALSE)`
```{r,echo=FALSE,eval=FALSE}
c(TRUE,FALSE,FALSE,TRUE,FALSE)
c(TRUE,TRUE,FALSE,TRUE,FALSE)
```
15. Make a character vector of gene names PAX6,ZIC2,OCT4 and SOX2.
Ex: `avec=c("a","b","c")` a makes a character vector of a,b and c
```{r,echo=FALSE,eval=FALSE}
c("PAX6","ZIC2","OCT4","SOX2")
```
16. Subset the vector using `[]` notation, get 5th and 6th elements.
Ex: `vec1[1]` gets the first element. `vec1[c(1,3)]` gets 1st and 3rd elements
```{r,echo=FALSE,eval=FALSE}
vec1[c(5,6)]
```
17. You can also subset any vector using a logical vector in `[]`. Run the following:
```{r, eval=FALSE}
myvec=1:5
myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)] # the length of the logical vector should be equal to length(myvec)
myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)]
```
18. `==,>,<, >=, <=` operators create logical vectors. See the results of the following operations:
```{r,eval=FALSE}
myvec > 3
myvec == 4
myvec <= 2
myvec != 4
```
19. Use `>` operator in `myvec[ ]` to get elements larger than 2 in `myvec` whic is described above
```{r,echo=FALSE,eval=FALSE}
myvec[ myvec > 2 ]
myvec[ myvec == 2 ]
myvec[ myvec != 2 ]
```
20. make a 5x3 matrix (5 rows, 3 columns) using `matrix()`.
Ex: matrix(1:6,nrow=3,ncol=2) makes a 3x2 matrix using numbers between 1 and 6
```{r,echo=FALSE,eval=FALSE}
mat=matrix(1:15,nrow=5,ncol=3)
```
21. What happens when you use `byrow = TRUE` in your matrix() as an additional argument?
Ex: `mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)`
```{r,echo=FALSE,eval=FALSE}
mat=matrix(1:15,nrow=5,ncol=3,byrow = TRUE)
```
22. Extract first 3 columns and first 3 rows of your matrix using `[]` notation.
```{r,echo=FALSE,eval=FALSE}
mat[1:3,1:3]
```
23. Extract last two rows of the matrix you created earlier.
Ex: `mat[2:3,]` or `mat[c(2,3),]` extracts 2nd and 3rd rows.
```{r,echo=FALSE,eval=FALSE}
mat[4:5,]
mat[c(nrow(mat)-1,nrow(mat)),]
tail(mat,n=1)
tail(mat,n=2)
```
24. Extract the first two columns and run `class()` on the result.
```{r,echo=FALSE,eval=FALSE}
class(mat[,1:2])
```
25. Extract first column and run `class()` on the result, compare with the above exercise.
```{r,echo=FALSE,eval=FALSE}
class(mat[,1])
```
26. Make a data frame with 3 columns and 5 rows, make sure first column is sequence
of numbers 1:5, and second column is a character vector.
Ex: `df=data.frame(col1=1:3,col2=c("a","b","c"),col3=3:1) # 3x3 data frame`.
Remember you need to make 3x5 data frame
```{r,echo=FALSE,eval=FALSE}
df=data.frame(col1=1:5,col2=c("a","2","3","b","c"),col3=5:1)
```
27. Extract first two columns and first two rows.
**HINT:** Same notation as matrices
```{r,echo=FALSE,eval=FALSE}
df[,1:2]
df[1:2,]
df[1:2,1:2]
```
28. Extract last two rows of the data frame you made.
**HINT:** Same notation as matrices
```{r,echo=FALSE,eval=FALSE}
df[,4:5]
```
29. Extract last two columns using column names of the data frame you made.
```{r,echo=FALSE,eval=FALSE}
df[,c("col2","col3")]
```
30. Extract second column using column names.
You can use `[]` or `$` as in lists, use both in two different answers.
```{r,echo=FALSE,eval=FALSE}
df$col2
df[,"col2"]
class(df["col1"])
class(df[,"col1"])
```
31. Extract rows where 1st column is larger than 3.
**HINT:** you can get a logical vector using `>` operator
,logical vectors can be used in `[]` when subsetting.
```{r,echo=FALSE,eval=FALSE}
df[df$col1 >3 , ]
```
32. Extract rows where 1st column is larger than or equal to 3.
```{r,echo=FALSE,eval=FALSE}
df[df$col1 >= 3 , ]
```
33. Convert data frame to the matrix. **HINT:** use `as.matrix()`.
Observe what happens to numeric values in the data frame.
```{r,echo=FALSE,eval=FALSE}
class(df[,c(1,3)])
as.matrix(df[,c(1,3)])
as.matrix(df)
```
34. Make a list using `list()` function, your list should have 4 elements
the one below has 2.
Ex: `mylist= list(a=c(1,2,3),b=c("apple,"orange"))`
```{r,echo=FALSE,eval=FALSE}
mylist= list(a=c(1,2,3),
b=c("apple","orange"),
c=matrix(1:4,nrow=2),
d="hello")
```
35. Select the 1st element of the list you made using `$` notation.
Ex: `mylist$a` selects first element named "a"
```{r,echo=FALSE,eval=FALSE}
mylist$a
```
36. Select the 4th element of the list you made earlier using `$` notation.
```{r,echo=FALSE,eval=FALSE}
mylist$d
```
37. Select the 1st element of your list using `[ ]` notation.
Ex: `mylist[1]` selects first element named "a", you get a list with one element.
Ex: `mylist["a"]` selects first element named "a", you get a list with one element.
```{r,echo=FALSE,eval=FALSE}
mylist[1] # -> still a list
mylist[[1]] # not a list
mylist["a"]
mylist[["a"]]
```
38. select the 4th element of your list using '[ ]' notation.
```{r,echo=FALSE,eval=FALSE}
mylist[4]
mylist[[4]]
```
39. Make a factor using factor(), with 5 elements.
Ex: `fa=factor(c("a","a","b"))`
```{r,echo=FALSE,eval=FALSE}
fa=factor(c("a","a","b","c","c"))
```
40. Convert a character vector to factor using `as.factor()`.
First, make a character vector using `c()` then use `as.factor()`.
```{r,echo=FALSE,eval=FALSE}
my.vec=c("a","a","b","c","c")
fa=as.factor(my.vec)
fa
```
41. Convert the factor you made above to character using `as.character()`.
```{r,echo=FALSE,eval=FALSE}
fa
as.character(fa)
as.numeric()
as.factor()
```
### Reading in and writing data out in R
42. Read CpG island (CpGi) data from the compGenomRData package `CpGi.table.hg18.txt`, this is a tab-separated file, store it in a variable called "cpgi". Use `cpgtFilePath=system.file("extdata",
"CpGi.table.hg18.txt",
package="compGenomRData")`
to get the file path within the installed `compGenomRData` package.
```{r,echo=FALSE,eval=FALSE}
cpgi=read.table(file="intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t")
cpgi=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t")
```
43. Use `head()` on CpGi to see first few rows.
```{r,echo=FALSE,eval=FALSE}
head(cpgi)
tail(cpgi)
```
43. Why doesn't the following work? see 'sep' argument at `help(read.table)`.
```{r}
cpgtFilePath=system.file("extdata",
"CpGi.table.hg18.txt",
package="compGenomRData")
cpgtFilePath
cpgiSepComma=read.table(cpgtFilePath,header=TRUE,sep=",")
head(cpgiSepComma)
```
43. What happens when `stringsAsFactors=FALSE` ?
`cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=FALSE,sep="\t",
stringsAsFactors=FALSE)`
```{r,echo=FALSE,eval=FALSE}
head(cpgiHF)
head(cpgi)
class(cpgiHF$V2)
class(cpgiHF$V2)
```
43. Read only first 10 rows of the CpGi table.
```{r,echo=FALSE,eval=FALSE}
cpgi10row=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t",nrow=10)
cpgi10row
```
44. Use `cpgtFilePath=system.file("extdata","CpGi.table.hg18.txt",package="compGenomRData")` to get the file path, then use
`read.table()` with argument `header=FALSE`. Use `head()` to see the results.
```{r,echo=FALSE,eval=FALSE}
df=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=FALSE,sep="\t")
head(df)
```
44. Write CpG islands to a text file called "my.cpgi.file.txt". Write the file
to your home folder, you can use `file="~/my.cpgi.file.txt"` in linux. `~/` denotes
home folder.
```{r,echo=FALSE,eval=FALSE}
write.table(cpgi,file="~/my.cpgi.file.txt")
```
45. Same as above but this time make sure use `quote=FALSE`,`sep="\t"` and `row.names=FALSE` arguments.
Save the file to "my.cpgi.file2.txt" and compare it with "my.cpgi.file.txt"
```{r,echo=FALSE,eval=FALSE}
write.table(cpgi,file="~/my.cpgi.file2.txt",quote=FALSE,sep="\t",row.names=FALSE)
```
46. Write out the first 10 rows of 'cpgi' data frame.
**HINT:** use subsetting for data frames we learned before.
```{r,echo=FALSE,eval=FALSE}
write.table(cpgi[1:10,],file="~/my.cpgi.fileNrow10.txt",quote=FALSE,sep="\t")
```
44. Write the first 3 columns of 'cpgi' data frame.
```{r,echo=FALSE,eval=FALSE}
dfnew=cpgi[,1:3]
write.table(dfnew,file="~/my.cpgi.fileCol3.txt",quote=FALSE,sep="\t")
```
45. Write CpG islands only on chr1. **HINT:** use subsetting with `[]`, feed a logical vector using `==` operator.
```{r,echo=FALSE,eval=FALSE}
write.table(cpgi[cpgi$chrom == "chr1",],file="~/my.cpgi.fileChr1.txt",
quote=FALSE,sep="\t",row.names=FALSE)
head(cpgi[cpgi$chrom == "chr1",])
```
46. Read two other data sets "rn4.refseq.bed" and "rn4.refseq2name.txt"
with `header=FALSE`, assign them to df1 and df2 respectively.
They are again included in the compGenomRData package, and you
can use `system.file()` function to get the file paths.
```{r,echo=FALSE,eval=FALSE}
df1=read.table("intro2R_data/data/rn4.refseq.bed",sep="\t",header=FALSE)
df2=read.table("intro2R_data/data/rn4.refseq2name.txt",sep="\t",header=FALSE)
```
47. Use `head()` to see what is inside of the the data frames above.
```{r,echo=FALSE,eval=FALSE}
head(df1)
head(df2)
```
48. Merge data sets using `merge()` and assign the results to variable named 'new.df', and use `head()` to see the results.
```{r,echo=FALSE,eval=FALSE}
new.df=merge(df1,df2,by.x="V4",by.y="V1")
head(new.df)
```
### Plotting in R
Please **run the following** for the rest of the exercises.
```{r}
set.seed(1001)
x1=1:100+rnorm(100,mean=0,sd=15)
y1=1:100