02-intro2R.Rmd

# Introduction to R for Genomic Data Analysis {#Rintro}

The aim of computational genomics is to provide biological interpretation and insights from high 
dimensional genomics data. Generally speaking, it is similar to any other kind 
of data analysis endeavor but often times doing computational genomics will require domain specific knowledge and tools.

As new high-throughput experimental techniques are on the rise, data analysis 
capabilities are sought-after features for researchers. The aim of this chapter is to first familiarize the readers with data analysis steps and then provide basics of R programming within the context of genomic data analysis. R is a free statistical programming language that is popular among researchers and data miners to build software and analyze data. Although 
basic R programming tutorials are easily accessible, we are aiming to introduce 
the subject with the genomic context in the background. The examples and 
narrative will always be from real-life situations when you try to analyze 
genomic data with R. We believe tailoring material to the context of genomics 
makes a difference when learning this programming language for sake of analyzing
genomic data.

## Steps of (genomic) data analysis
Regardless of the analysis type, the data analysis has a common pattern. We will
discuss this general pattern and how it applies to genomics problems. The data analysis steps typically include data collection, quality check and cleaning, processing, modeling, visualization and reporting. Although, one expects to go through these steps in a linear fashion, it is normal to go back and repeat the steps with different parameters or tools. In practice, data analysis requires going through the same steps over and over again in order to be able to do a combination of the following: a)  answering other related questions, b) dealing with data quality issues that are later realized, and, c) including new data sets to the analysis. 

We will now go through a brief explanation of the steps within the context of genomic data analysis.

### Data collection
Data collection refers to any source, experiment or survey that provides data for the data analysis question you have. In genomics, data collection is done by high-throughput assays introduced in chapter \@ref(intro). One can also use publicly available data sets and specialized databases also mentioned in chapter \@ref(intro). How much data and what type of data you should collect depends on the question you are trying to answer and the technical and biological variability of the system you are studying.

### Data quality check and cleaning
In general, data analysis almost always deals with imperfect data. It is 
common to have missing values or measurements that are noisy. Data quality check
and cleaning aims to identify any data quality issue and clean it from the dataset.

High-throughput genomics data is produced by technologies that could embed 
technical biases into the data. If we were to give an example from sequencing,
the sequenced reads do not have the same quality of bases called. Towards the
ends of the reads, you could have bases that might be called incorrectly. Identifying those low quality bases and removing them will improve read mapping step. 

### Data processing 
This step refers to processing the data to a format that is suitable for 
exploratory analysis and modeling. Often times, the data will not come in ready
to analyze format. You may need to convert it to other formats by transforming 
data points (such as log transforming, normalizing etc), or subset the data set
with some arbitrary or pre-defined condition. In terms of genomics, processing
includes multiple steps. Following the sequencing analysis example above, 
processing will include aligning reads to the genome and quantification over genes or regions of interest. This is simply counting how many reads are covering your regions of interest. This quantity can give you ideas about how much a gene is expressed if your experimental protocol was RNA sequencing. This can be followed by some normalization to aid the next step. 

### Exploratory data analysis and modeling
This phase usually takes in the processed or semi-processed data and applies machine-learning or statistical methods to explore the data. Typically, one needs to see relationship between variables measured, relationship between samples based on the variables measured. At this point, we might be looking to see if the samples group as expected by the experimental design, are there outliers or any other anomalies ? After this step you might want to do additional clean up or re-processing to deal with anomalies. 

Another related step is modeling. This generally refers to modeling your variable of interest based on other variables you measured. In the context of genomics, it could be that you are trying to predict disease status of the patients from expression of genes you measured from their tissue samples. Then your variable of interest is the disease status and . This is generally called predictive modeling and could be solved with regression based or any other machine-learning methods. This kind of approach is generally called "predictive modeling".


Statistical modeling would also be a part of this modeling step, this can cover predictive modeling as well where we use statistical methods such as linear regression. Other analyses such as hypothesis testing, where we have an expectation and we are trying to confirm that expectation is also related to statistical modeling. A good example of this in genomics is the differential gene expression analysis. This can be formulated as comparing two data sets, in this case expression values from condition A and condition B, with the expectation that condition A and condition B has similar expression values. You will see more on this in chapter \@ref(stats).
 
### Visualization and reporting
Visualization is necessary for all the previous steps more or less. But in the final phase, we need final figures, tables and text that describes the outcome of your analysis. This will be your report. In genomics, we use common data visualization methods as well as specific visualization methods developed or popularized by genomic data analysis. You will see many popular visualization methods in chapters \@ref(stats) and \@ref{genomicIntervals}

### Why use R for genomics ?
R, with its statistical analysis
heritage, plotting features and rich user-contributed packages is one of the 
best languages for the task of analyzing genomic data.
High-dimensional genomics datasets are usually suitable to 
be analyzed with core R packages and functions. On top of that, Bioconductor and CRAN have an 
array of specialized tools for doing genomics-specific analysis. Here is a list of computational genomics tasks that can be completed using R.

#### Data cleanup and processing

Most of general data clean up, such as removing incomplete columns and values, reorganizing and transforming data, these tasks can be achieved using R. In addition, with the help of packages R can connect to databases in various formats such as mySQL, mongoDB, etc., and query and get the data to R environment using database specific tools. 

On top of these, genomic data specific processing and quality check can be achieved via R/Bioconductor packages. For example, sequencing read quality checks and even HT-read alignments can be achieved via R packages. 

#### General data anaylsis and exploration

Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R.

 - unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization 
(PCA, ICA etc.)
 - supervised data analysis: generalized linear models, support vector machines, random forests

#### Genomics-specific data analysis methods
 R/Bioconductor gives you access to multitude of other bioinformatics specific algorithms. Here are some of the things you can do.

 - Sequence analysis: TF binding motifs, GC content and CpG counts of a given DNA sequence
 - Differential expression (or arrays and sequencing based measurements)
 - Gene set/Pathway analysis: What kind of genes are enriched in my gene set
 - Genomic Interval operations such as Overlapping CpG islands with transcription start sites, and filtering based on overlaps
 - Overlapping aligned reads with exons and counting aligned reads per gene
 
#### Visualization
Visualization is an important part of all data analysis techniques including computational genomics. Again, you can use core visualization techniques in R and also genomics specific ones with the help of specific packages. Here are some of the things you can do with R.

 - Basic plots: Histograms, scatter plots, bar plots, box plots, heatmaps
 - ideograms and circos plots for genomics provides visualization of different features over the whole genome.
 - meta-profiles of genomic features, such as read enrichment over all promoters
 - Visualization of quantitative assays for given locus in the genome

## Getting started with R
Download and install R http://cran.r-project.org/ and RStudio http://www.rstudio.com/ if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R.
You will need specific data sets to run the codes in this document. Download the data.zip[URL to come] and extract it to your directory of choice. The folder name should be “data” and your R working directory should be level above the data folder. That means in your R console, when you type “dir(“data”)” you should be able to see the contents of the data folder. You can change your working directory by *setwd()* command and get your current working directory with *getwd()* command in R. In RStudio, you can click on the top menu and change the location of your working directory via user interface.


### Installing packages
R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology related packages. However main package repository of R, called CRAN, has also computational biology related packages. In addition, R-Forge(http://r-forge.r-project.org/), GitHub(https://github. com/), and googlecode(http://code.google.com) are other locations where R packages might be hosted.
You can install CRAN packages using install.packages(). (# is the comment character in R)
```{r,eval=FALSE}
# install package named "randomForests" from CRAN
install.packages("randomForests")
```
You can install bioconductor packages with a specific installer script
```{r,eval=FALSE}
# get the installer package
source("http://bioconductor.org/biocLite.R")
# install bioconductor package "rtracklayer"
biocLite("rtracklayer")
```
You can install packages from github using *install_github()* function from **devtools**
```{r,eval=FALSE}
library(devtools)
install_github("hadley/stringr")
```
Another way to install packages are from the source.
```{r,eval=FALSE}
# download the source file
download.file("http://goo.gl/3pvHYI",
               destfile="methylKit_0.5.7.tar.gz")
# install the package from the source file
install.packages("methylKit_0.5.7.tar.gz",
                 repos=NULL,type="source")
# delete the source file
unlink("methylKit_0.5.7.tar.gz")
```
You  can also update CRAN and Bioconductor packages.
```{r,eval=FALSE}
# updating CRAN packages
update.packages()

# updating bioconductor packages
source("http://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
```

### Installing packages in custom locations
If you will be using R on servers or computing clusters rather than your personal computer it is unlikely that you will have administrator access to install packages. In that case, you can install packages in custom locations by telling R where to look for additional packages. This is done by setting up an .Renviron file in your home directory and add the following line:
```
R_LIBS=~/Rlibs
```

This tells R that “Rlibs” directory at your home directory will be the first choice of locations to look for packages and install packages (The directory name and location is up to you above is just an example). You should go and create that directory now. After that, start a fresh R session and start installing packages. From now on, packages will be installed to your local directory where you have read-write access.

### Getting help on functions and packages
You can get help on functions by help() and help.search() functions. You can list the functions in a package with ls() function

```{r,eval=FALSE}
library(MASS)
ls("package:MASS") # functions in the package
ls() # objects in your R enviroment
# get help on hist() function
?hist
help("hist")
# search the word "hist" in help pages
help.search("hist")
??hist

```
#### More help needed?
In addition, check package vignettes for help and practical understanding of the functions. All Bionconductor packages have vignettes that walk you through example analysis. Google search will always be helpful as well, there are many blogs and web pages that have posts about R. R-help, Stackoverflow and R-bloggers are usually source of good and reliable information.


## Computations in R
R can be used as an ordinary calculator, some say it is an over-grown calculator. Here are some examples. Remember **#** is the comment character. The comments give details about the operations in case they are not clear.
```r{}
2 + 3 * 5       # Note the order of operations.
log(10)        # Natural logarithm with base e
5^2            # 5 raised to the second power
3/2            # Division
sqrt(16)      # Square root
abs(3-7)      # Absolute value of 3-7
pi             # The number
exp(2)        # exponential function
# This is a comment line


```

## Data structures
R has multiple data structures. If you are familiar with excel you can think of a single excel sheet as a table and data structures as building blocks of that table. Most of the time you will deal with tabular data sets or you will want to transform your raw data to a tabular data set, and you will try to manipulate this tabular data set in some way. For example, you may want to take sub-sections of the table or extract all the values in a column. For these and similar purposes, it is essential to know what are the common data structures in R and how they can be used. R deals with named data structures, this means you can give names to data structures and manipulate or operate on them using those names. It will be clear soon what we mean by this if "named data structures" does not ring a bell.

### Vectors
Vectors are one the core R data structures. It is basically a list of elements of the same type (numeric,character or logical). Later you will see that every column of a table will be represented as a vector. R handles vectors easily and intuitively. You can create vectors with c() function, however that is not the only way. The operations on vectors will propagate to all the elements of the vectors.

```{r}
x<-c(1,3,2,10,5)    #create a vector named x with 5 components
x = c(1,3,2,10,5)  
x
y<-1:5              #create a vector of consecutive integers y
y+2                 #scalar addition
2*y                 #scalar multiplication
y^2                 #raise each component to the second power
2^y                 #raise 2 to the first through fifth power
y                   #y itself has not been unchanged
y<-y*2
y                   #it is now changed
r1<-rep(1,3)        # create a vector of 1s, length 3
length(r1)           #length of the vector
class(r1)            # class of the vector
a<-1                # this is actually a vector length one
```
### Matrices
A matrix refers to a numeric array of rows and columns. You can think of it as a stacked version of vectors where each row or column is a vector. One of the easiest ways to create a matrix is to combine vectors of equal length using *cbind()*, meaning 'column bind'.

```{r}
x<-c(1,2,3,4)
y<-c(4,5,6,7)
m1<-cbind(x,y);m1
t(m1)                # transpose of m1
dim(m1)              # 2 by 5 matrix
```
You can also directly list the elements and specify the matrix:
```{r}
m2<-matrix(c(1,3,2,5,-1,2,2,3,9),nrow=3)
m2
```
Matrices and the next data structure **data frames** are tabular data structures. You can subset them using **[]** and providing desired rows and columns to subset. Figure  \@ref(fig:slicingDataFrames) shows how that works conceptually.

```{r,slicingDataFrames,fig.cap="slicing/subsetting of a matrix and a data frame",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/slicingDataFrames.png" )
```

### Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). A data frame can be constructed by data.frame() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.

```{r}
chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- data.frame(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
mydata <- data.frame(chr=chr,start=start,end=end,strand=strand)
mydata
```
There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that has a value in a column larger than your threshold.

```{r}
mydata[,2:4] # columns 2,3,4 of data frame
mydata[,c("chr","start")] # columns chr and start from data frame
mydata$start # variable start in the data frame
mydata[c(1,3),] # get 1st and 3rd rows
mydata[mydata$start>400,] # get all rows where start>400
```

### Lists
An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.
```{r}
# example of a list with 4 components
# a string, a numeric vector, a matrix, and a scalar
w <- list(name="Fred",
       mynumbers=c(1,2,3),
       mymatrix=matrix(1:4,ncol=2),
       age=5.3)
w
```
You can extract elements of a list using the **[[]]** convention using either its position in the list or its name.
```{r}
w[[3]] # 3rd component of the list
w[["mynumbers"]] # component named mynumbers in list
w$age
```

### Factors
Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models.
```{r}
features=c("promoter","exon","intron")
f.feat=factor(features)
```
Important thing to note is that when you are reading a data.frame with read.table() or creating a data frame with **data.frame()** character columns are stored as factors by default, to change this behavior you need to set **stringsAsFactors=FALSE** in **read.table()** and/or **data.frame()** function arguments.


## Data types
There are four common data types in R, they are **numeric**, **logical**, **character** and **integer**. All these data types can be used to create vectors natively.
```{r}
#create a numeric vector x with 5 components
x<-c(1,3,2,10,5)
x
#create a logical vector x
x<-c(TRUE,FALSE,TRUE)
x
# create a character vector
x<-c("sds","sd","as")
x
class(x)
# create an integer vector
x<-c(1L,2L,3L)
x
class(x)
```


## Reading and writing data
Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is BED format used primarily by UCSC genome browser but most other genome browsers and tools will support BED format. We have all the annotation data in BED format. In R, you can easily read tabular format data with read.table() function.
```{r}
enhancerFilePath=system.file("extdata",
                "subset.enhancers.hg18.bed",
                package="compGenomRData")
cpgiFilePath=system.file("extdata",
                "subset.cpgi.hg18.bed",
                package="compGenomRData")
# read enhancer marker BED file
enh.df <- read.table(enhancerFilePath, header = FALSE) 

# read CpG island BED file
cpgi.df <- read.table(cpgiFilePath, header = FALSE) 

# check first lines to see how the data looks like
head(enh.df)
head(cpgi.df)
```

You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using write.table() function. Now let us write out cpgi.df, we will write it out as a tab-separated file, pay attention to the arguments.
```{r,tidy=FALSE,eval=FALSE}
write.table(cpgi.df,file="cpgi.txt",quote=FALSE,
            row.names=FALSE,col.names=FALSE,sep="\t")
```
You can save your R objects directly into a file using save() and saveRDS() and load them back in with load() and readRDS(). By using these functions you can save any R object whether or not they are in data frame or matrix classes.
```{r,eval= FALSE}
save(cpgi.df,enh.df,file="mydata.RData")
load("mydata.RData")
# saveRDS() can save one object at a type
saveRDS(cpgi.df,file="cpgi.rds")
x=readRDS("cpgi.rds")
head(x)
```
One important thing is that with `save()` you can save many objects at a time and when they are loaded into memory with `load(` they retain their variable names. For example, in the above code when you use `load("mydata.RData")` in a fresh R session, an object names “cpg.df” will be created. That means you have to figure out what name you gave it to the objects before saving them. On the contrary to that, when you save an object by `saveRDS()` and read by `readRDS()` the name of the object is not retained, you need to assign the output of `readRDS()` to a new variable (“x” in the above code chunk).


## Plotting in R
R has great support for plotting and customizing plots. We will show only a few below. Let us sample 50 values from normal distribution and plot them as a histogram.

```{r,out.width='65%',fig.width=6.5}
# sample 50 values from normal distribution
# and store them in vector x
x<-rnorm(50)
hist(x) # plot the histogram of those values
```

We can modify all the plots by providing certain arguments to the plotting function. Now let's give a title to the plot using **'main'** argument. We can also change the color of the bars using **'col'** argument. You can simply provide the name of the color. Below, we are using **'red'** for the color. See Figure below for the result this chunk.

```{r,out.width='65%',fig.width=6.5}
hist(x,main="Hello histogram!!!",col="red")
```

Next, we will make a scatter plot. Scatter plots are one  the most common plots you will encounter in data analysis. We will sample another set of 50 values and plotted those against the ones we sampled earlier. Scatterplot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those that in following chapters. The scatter plot from our sampling experiment is shown in the figure. Notice that, in addition to main we used **“xlab”** and **“ylab”** arguments to give labels to the plot. You can customize the plots even more than this. See **?plot** and **?par** for more arguments that can help you customize the plots.

```{r,out.width='65%',fig.width=5.5}
# randomly sample 50 points from normal distribution
y<-rnorm(50)
#plot a scatter plot
# control x-axis and y-axis labels
plot(x,y,main="scatterplot of random samples",
        ylab="y values",xlab="x values")
```

we can also plot boxplots for vectors x and y. Boxplots depict groups of numerical data through their quartiles. The edges of the box denote 1st and 3rd quartile, and the line that crosses the box is the median. Whiskers usually are defined using interquantile range:

*lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q1+1.5[IQR]*

In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers.

```{r,out.width='65%',fig.width=5.5}
 boxplot(x,y,main="boxplots of random samples")
```
Next up is bar plot which you can plot by **barplot()** function. We are going to plot four imaginary percentage values and color them with two colors, and this time we will also show how to draw a legend on the plot using **legend()** function.

```{r,,out.width='65%',fig.width=5.5,tidy=FALSE}
perc=c(50,70,35,25)
barplot(height=perc,
        names.arg=c("CpGi","exon","CpGi","exon"),
        ylab="percentages",main="imagine %s",
        col=c("red","red","blue","blue"))
legend("topright",legend=c("test","control"),
       fill=c("red","blue"))
```

## Saving plots
If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following:
 1. Open a graphics device
 2. Create the plot
 3. Close the graphics device

```{r,eval=FALSE}
pdf("mygraphs/myplot.pdf",width=5,height=5)
plot(x,y)
dev.off()
```
 Alternatively, you can first create the plot then copy the plot to a graphic device.
```{r,eval=FALSE}
plot(x,y)
dev.copy(pdf,"mygraphs/myplot.pdf",width=7,height=5)
dev.off()
```

## Functions and control structures (for, if/else etc.)

### User defined functions
Functions are useful for transforming larger chunks of code to re-usable pieces of code. Generally, if you need to execute certain tasks with variable parameters then it is time you write a function. A function in R takes different arguments and returns a definite output, much like mathematical functions. Here is a simple function takes two arguments, x and y, and returns the sum of their squares.

```{r}
sqSum<-function(x,y){
result=x^2+y^2
return(result)
}
# now try the function out
sqSum(2,3)
```


Functions can also output plots and/or messages to the terminal. Here is a function that prints a message to the terminal:
```{r}
sqSumPrint<-function(x,y){
result=x^2+y^2
cat("here is the result:",result,"\n")
}
# now try the function out
sqSumPrint(2,3)
```

Sometimes we would want to execute a certain part of the code only if certain condition is satisfied. This condition can be anything from the type of an object (Ex: if object is a matrix execute certain code), or it can be more complicated such as if object value is between certain thresholds. Let us see how they can be used3. They can be used anywhere in your code, now we will use it in a function.

```{r,eval=FALSE}
cpgi.df <- read.table("intro2R_data/data/subset.cpgi.hg18.bed", header = FALSE)
# function takes input one row
# of CpGi data frame
largeCpGi<-function(bedRow){
 cpglen=bedRow[3]-bedRow[2]+1
 if(cpglen>1500){
	cat("this is large\n")
 }
 else if(cpglen<=1500 & cpglen>700){
	cat("this is normal\n")
 }
 else{
	cat("this is short\n")
 }
}
largeCpGi(cpgi.df[10,])
largeCpGi(cpgi.df[100,])
largeCpGi(cpgi.df[1000,])
```

### Loops and looping structures in R
When you need to repeat a certain task or a execute a function multiple times, you can do that with the help of loops. A loop will execute the task until a certain condition is reached. The loop below is called a “for-loop” and it executes the task sequentially 10 times.
```{r}
for(i in 1:10){ # number of repetitions
cat("This is iteration") # the task to be repeated
print(i)
}
```
The task above is a bit pointless, normally in a loop, you would want to do something meaningful. Let us calculate the length of the CpG islands we read in earlier. Although this is not the most efficient way of doing that particular task, it serves as a good example for looping. The code below will be execute hundred times, and it will calculate the length of the CpG islands for the first 100 islands in
the data frame (by subtracting the end coordinate from the start coordinate).


**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it for thousands of times.

```{r}
# this is where we will keep the lenghts
# for now it is an empty vector
result=c()
# start the loop
for(i in 1:100){
    #calculate the length
	len=cpgi.df[i,3]-cpgi.df[i,2]+1
    #append the length to the result
	result=c(result,len)
}
# check the results
head(result)
```

#### apply family functions instead of loops
R has other ways of repeating tasks that tend to be more efficient than using loops. They are known as the **“apply”** family of functions, which include *apply, lapply, mapply and tapply* (and some other variants). All of these functions apply a given function to a set of instances and returns the result of those functions for each instance. The differences between them is that they take different type of inputs. For example apply works on data frames or matrices and applies the function on each row or column of the data structure. *lapply* works on lists or vectors and applies a function which takes the list element as an argument. Next we will demonstrate how to use **apply()** on a matrix. The example applies the sum function on the rows of a matrix, it basically sums up the values on each row of the matrix, which is conceptualized in Figure  \@ref(fig:applyConcept).


```{r,applyConcept,fig.cap="apply concept in R",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/apply.png" )
```

```{r}
mat=cbind(c(3,0,3,3),c(3,0,0,0),c(3,0,0,3),c(1,1,0,0),c(1,1,1,0),c(1,1,1,0))
result<-apply(mat,1,sum)
result
# OR you can define the function as an argument to apply()
result<-apply(mat,1,function(x) sum(x))
result
```
Notice that we used a second argument which equals to 1, that indicates that rows of the matrix/ data frame will be the input for the function. If we change the second argument to 2, this will indicate that columns should be the input for the function that will be applied. See Figure \@ref(fig:applyConcept2) for the visualization of apply() on columns.


```{r,applyConcept2,fig.cap="apply function on columns",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/apply2.png" )
```

```{r}
result<-apply(mat,2,sum)
result
```
Next, we will use **lapply()**, which applies a function on a list or a vector. The function that will be applied is a simple function that takes the square of a given number.

```{r}
input=c(1,2,3)
lapply(input,function(x) x^2)
```
**mapply()** is another member of apply family, it can apply a function on an unlimited set of vectors/lists, it is like a version of lapply that can handle multiple vectors as arguments. In this case, the argument to the mapply() is the function to be applied and the sets of parameters to be supplied as arguments of the function. This conceptualized Figure \@ref(fig:mapplyConcept), the function to be applied is a function that takes to arguments and sums them up. The arguments to be summed up are in the format of vectors, Xs and Ys. `mapply()` applies the summation function to each pair in Xs and Ys vector. Notice that the order of the input function and extra arguments are different for mapply.

```{r,mapplyConcept,fig.cap="mapply concept",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/mapply.png" )
```

```{r}
Xs=0:5
Ys=c(2,2,2,3,3,3)
result<-mapply(function(x,y) sum(x,y),Xs,Ys)
result
```

#### apply family functions on multiple cores
If you have large data sets apply family functions can be slow (although probably still better than for loops). If that is the case, you can easily use the parallel versions of those functions from parallel package. These functions essentially divide your tasks to smaller chunks run them on separate CPUs and merge the results from those parallel operations. This concept is visualized at Figure below , `mcapply` runs the summation function on three different processors. Each processor executes the summation function on a part of the data set, and the results are merged and returned as a single vector that has the same order as the input parameters Xs and Ys.

```{r,mcapplyConcept,fig.cap="mcapplyconcept",fig.align = 'center',out.width='80%',echo=FALSE}
knitr::include_graphics("images/mcmapply.png" )
```

#### Vectorized Functions in R
The above examples have been put forward to illustrate functions and loops in R because functions using sum() are not complicated and easy to understand. You will probably need to use loops and looping structures with more complicated functions. In reality, most of the operations we used do not need the use of loops or looping structures because there are already vectorized functions that can achieve the same outcomes, meaning if the input arguments are R vectors the output will be a vector as well, so no need for loops or vectorization.

For example, instead of using mapply() and sum() functions we can just use + operator and sum up Xs and Ys.
```{r}
result=Xs+Ys
result
```
In order to get the column or row sums, we can use the vectorized functions colSums() and rowSums().
```{r}
colSums(mat)
rowSums(mat)
```
However, remember that not every function is vectorized in R, use the ones that are. But sooner or later, apply family functions will come in handy.


## Exercises

### Computations in R


1. Sum 2 and 3, use +

```{r,eval=FALSE,echo=FALSE}
2+3
```


2. Take the square root of 36, use sqrt()

```{r,echo=FALSE,eval=FALSE}
sqrt(36)
```

3. Take the log10 of 1000, use function `log10()`

```{r,echo=FALSE,eval=FALSE}
log10(1000)
```

4. Take the log2 of 32, use function `log2()`


```{r,echo=FALSE,eval=FALSE}
log2(32)
```

5. Assign the sum of 2,3 and 4 to variable x

```{r,echo=FALSE,eval=FALSE}
x =  2+3+4
x <- 2+3+4
```


6. Find the absolute value of  `5 - 145` using `abs()` function
```{r,echo=FALSE,eval=FALSE}
abs(5-145)
```

7. Calculate the square root of 625, divide it by 5 and assign it to variable `x`.
Ex: `y= log10(1000)/5`, the previous statement takes log10 of 1000, divides it 
by 5 and assigns the value to variable y

```{r,echo=FALSE,eval=FALSE}
x = sqrt(625)/5
 
```

8. Multiply the value you get from previous exercise with 10000, assign it variable x
Ex: `y=y*5`, multiplies y with 5 and assigns the value to y.

**KEY CONCEPT:** results of computations or arbitrary values can be stored in variables we can re-use those variables later on and over-write them with new values

```{r,echo=FALSE,eval=FALSE}
x2 = x*10000
```

### Data structures in R 


10. Make a vector of 1,2,3,5 and 10 using `c()`, assign it to `vec` variable.
Ex: `vec1=c(1,3,4)` makes a vector out of 1,3,4.

```{r,echo=FALSE,eval=FALSE}
c(1:5,10)
vec1=c(1,2,3,4,5,10)
```

11. Check the length of your vector with length(). 
Ex: `length(vec1)` should return 3

```{r,echo=FALSE,eval=FALSE}
length(vec1)
```

12. Make a vector of all numbers between 2 and 15.
Ex: `vec=1:6` makes a vector of numbers between 1 and 6, assigns to vec variable

```{r,echo=FALSE,eval=FALSE}
vec=2:15 
```

13. Make a vector of  4s repeated 10 times using `rep()` function. Ex: `rep(x=2,times=5)` makes a vector of 2s repeated 5 times

```{r,echo=FALSE,eval=FALSE}
rep(x=4,times=10)
rep(4,10)
```

14. Make a logical vector with TRUE, FALSE values of length 4, use `c()`.
Ex: `c(TRUE,FALSE)`

```{r,echo=FALSE,eval=FALSE}
c(TRUE,FALSE,FALSE,TRUE,FALSE)
c(TRUE,TRUE,FALSE,TRUE,FALSE)
``` 

15. Make a character vector of gene names PAX6,ZIC2,OCT4 and SOX2.
Ex: `avec=c("a","b","c")` a makes a character vector of a,b and c

```{r,echo=FALSE,eval=FALSE}
c("PAX6","ZIC2","OCT4","SOX2")
```

16. Subset the vector using `[]` notation, get 5th and 6th elements. 
Ex: `vec1[1]` gets the first element. `vec1[c(1,3)]` gets 1st and 3rd elements

```{r,echo=FALSE,eval=FALSE}
vec1[c(5,6)]
```

17. You can also subset any vector using a logical vector in `[]`. Run the following:

```{r, eval=FALSE}
myvec=1:5
myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)] # the length of the logical vector should be equal to length(myvec) 
myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)]
``` 

18. `==,>,<, >=, <=` operators create logical vectors. See the results of the following operations:

```{r,eval=FALSE}
myvec > 3
myvec == 4
myvec <= 2
myvec != 4
```

19.  Use `>` operator in `myvec[ ]` to get elements larger than 2 in `myvec` whic is described above
```{r,echo=FALSE,eval=FALSE}
myvec[ myvec > 2 ] 
myvec[ myvec == 2 ] 
myvec[ myvec != 2 ] 
```


20. make a 5x3 matrix (5 rows, 3 columns) using `matrix()`.
Ex: matrix(1:6,nrow=3,ncol=2) makes a 3x2 matrix using numbers between 1 and 6

```{r,echo=FALSE,eval=FALSE}

mat=matrix(1:15,nrow=5,ncol=3)

```

21. What happens when you use `byrow = TRUE` in your matrix() as an additional argument?
Ex: `mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)`

```{r,echo=FALSE,eval=FALSE}
mat=matrix(1:15,nrow=5,ncol=3,byrow = TRUE)
```

22. Extract first 3 columns and first 3 rows of your matrix using `[]` notation.

```{r,echo=FALSE,eval=FALSE}
mat[1:3,1:3]
```

23. Extract last two rows of the matrix you created earlier.

Ex: `mat[2:3,]` or `mat[c(2,3),]` extracts 2nd and 3rd rows.

```{r,echo=FALSE,eval=FALSE}
mat[4:5,] 
mat[c(nrow(mat)-1,nrow(mat)),] 
tail(mat,n=1)
tail(mat,n=2)
```

24. Extract the first two columns and run `class()` on the result.

```{r,echo=FALSE,eval=FALSE}
class(mat[,1:2])
```

25. Extract first column and run `class()` on the result, compare with the above exercise.

```{r,echo=FALSE,eval=FALSE}
class(mat[,1])
```

26. Make a data frame with 3 columns and 5 rows, make sure first column is sequence
of numbers 1:5, and second column is a character vector.

Ex: `df=data.frame(col1=1:3,col2=c("a","b","c"),col3=3:1) # 3x3 data frame`.

Remember you need to make 3x5 data frame

```{r,echo=FALSE,eval=FALSE}
df=data.frame(col1=1:5,col2=c("a","2","3","b","c"),col3=5:1)
```

27. Extract first two columns and first two rows. 

**HINT:** Same notation as matrices

```{r,echo=FALSE,eval=FALSE}
df[,1:2]

df[1:2,]
df[1:2,1:2]
```

28. Extract last two rows of the data frame you made.

**HINT:** Same notation as matrices

```{r,echo=FALSE,eval=FALSE}
df[,4:5]
```

29. Extract last two columns using column names of the data frame you made.

```{r,echo=FALSE,eval=FALSE}
df[,c("col2","col3")]
```

30. Extract second column using column names.
You can use `[]` or `$` as in lists, use both in two different answers.

```{r,echo=FALSE,eval=FALSE}
df$col2
df[,"col2"]
class(df["col1"])
class(df[,"col1"])
```

31. Extract rows where 1st column is larger than 3.
**HINT:** you can get a logical vector using `>` operator
,logical vectors can be used in `[]` when subsetting.

```{r,echo=FALSE,eval=FALSE}
df[df$col1 >3 , ]
```

32. Extract rows where 1st column is larger than or equal to 3.

```{r,echo=FALSE,eval=FALSE}
df[df$col1 >= 3 , ]
```

33. Convert data frame to the matrix. **HINT:** use `as.matrix()`.

Observe what happens to numeric values in the data frame.

```{r,echo=FALSE,eval=FALSE}
class(df[,c(1,3)])
as.matrix(df[,c(1,3)])
as.matrix(df)
```

34. Make a list using `list()` function, your list should have 4 elements
the one below has 2.

Ex: `mylist= list(a=c(1,2,3),b=c("apple,"orange"))`

```{r,echo=FALSE,eval=FALSE}
mylist= list(a=c(1,2,3),
            b=c("apple","orange"),
            c=matrix(1:4,nrow=2),
            d="hello")
```

35. Select the 1st element of the list you made using `$` notation.

Ex: `mylist$a` selects first element named "a"

```{r,echo=FALSE,eval=FALSE}
mylist$a
```

36. Select the 4th element of the list you made earlier using `$` notation.

```{r,echo=FALSE,eval=FALSE}
mylist$d
```

37. Select the 1st element of your list using `[ ]` notation. 

Ex: `mylist[1]` selects first element named "a", you get a list with one element.

Ex: `mylist["a"]` selects first element named "a", you get a list with one element.

```{r,echo=FALSE,eval=FALSE}
mylist[1] # -> still a list
mylist[[1]] # not a list

mylist["a"] 
mylist[["a"]] 
```

38. select the 4th element of your list using '[ ]' notation.

```{r,echo=FALSE,eval=FALSE}
mylist[4] 
mylist[[4]]
```

39. Make a factor using factor(), with 5 elements.
Ex: `fa=factor(c("a","a","b"))`

```{r,echo=FALSE,eval=FALSE}
fa=factor(c("a","a","b","c","c"))
```

40. Convert a character vector to factor using `as.factor()`.
First, make a character vector using `c()` then use `as.factor()`.

```{r,echo=FALSE,eval=FALSE}
my.vec=c("a","a","b","c","c")
fa=as.factor(my.vec)
fa
```

41. Convert the factor you made above to character using `as.character()`.

```{r,echo=FALSE,eval=FALSE}
fa
as.character(fa)
as.numeric()
as.factor()
```

### Reading in and writing data out in R

42. Read CpG island (CpGi) data from the compGenomRData package `CpGi.table.hg18.txt`, this is a tab-separated file, store it in a variable called "cpgi". Use `cpgtFilePath=system.file("extdata",
                "CpGi.table.hg18.txt",
                package="compGenomRData")`
to get the file path within the installed `compGenomRData` package.

```{r,echo=FALSE,eval=FALSE}
cpgi=read.table(file="intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t")
cpgi=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t")
```

43. Use `head()` on CpGi to see first few rows.

```{r,echo=FALSE,eval=FALSE}
head(cpgi)
tail(cpgi)
```


43. Why doesn't the following work? see 'sep' argument at `help(read.table)`.

```{r}
cpgtFilePath=system.file("extdata",
                "CpGi.table.hg18.txt",
                package="compGenomRData")
cpgtFilePath
cpgiSepComma=read.table(cpgtFilePath,header=TRUE,sep=",")
head(cpgiSepComma)
```

43. What happens when `stringsAsFactors=FALSE` ?
`cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=FALSE,sep="\t",
                  stringsAsFactors=FALSE)`
                  
```{r,echo=FALSE,eval=FALSE}                  
head(cpgiHF)
head(cpgi)
class(cpgiHF$V2)
class(cpgiHF$V2)
```

43. Read only first 10 rows of the CpGi table.

```{r,echo=FALSE,eval=FALSE}
cpgi10row=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=TRUE,sep="\t",nrow=10)
cpgi10row
```

44. Use `cpgtFilePath=system.file("extdata","CpGi.table.hg18.txt",package="compGenomRData")` to get the file path, then use
`read.table()` with argument `header=FALSE`. Use `head()` to see the results.

```{r,echo=FALSE,eval=FALSE}
df=read.table("intro2R_data/data/CpGi.table.hg18.txt",header=FALSE,sep="\t")
head(df)
```

44. Write CpG islands to a text file called "my.cpgi.file.txt". Write the file
to your home folder, you can use `file="~/my.cpgi.file.txt"` in linux. `~/` denotes
home folder.

```{r,echo=FALSE,eval=FALSE}
write.table(cpgi,file="~/my.cpgi.file.txt")
```

45. Same as above but this time make sure use `quote=FALSE`,`sep="\t"` and `row.names=FALSE` arguments. 
Save the file to "my.cpgi.file2.txt" and compare it with "my.cpgi.file.txt"

```{r,echo=FALSE,eval=FALSE}
write.table(cpgi,file="~/my.cpgi.file2.txt",quote=FALSE,sep="\t",row.names=FALSE)
```

46. Write out the first 10 rows of 'cpgi' data frame.
**HINT:** use subsetting for data frames we learned before.

```{r,echo=FALSE,eval=FALSE}
write.table(cpgi[1:10,],file="~/my.cpgi.fileNrow10.txt",quote=FALSE,sep="\t")
```

44. Write the first 3 columns of 'cpgi' data frame.

```{r,echo=FALSE,eval=FALSE}
dfnew=cpgi[,1:3]
write.table(dfnew,file="~/my.cpgi.fileCol3.txt",quote=FALSE,sep="\t")
```

45. Write CpG islands only on chr1. **HINT:** use subsetting with `[]`, feed a logical vector using `==` operator.

```{r,echo=FALSE,eval=FALSE}
write.table(cpgi[cpgi$chrom == "chr1",],file="~/my.cpgi.fileChr1.txt",
            quote=FALSE,sep="\t",row.names=FALSE)
head(cpgi[cpgi$chrom == "chr1",])
```

46. Read two other data sets "rn4.refseq.bed" and "rn4.refseq2name.txt"
with `header=FALSE`, assign them to df1 and df2 respectively.
They are again included in the compGenomRData package, and you
can use `system.file()` function to get the file paths.

```{r,echo=FALSE,eval=FALSE}
df1=read.table("intro2R_data/data/rn4.refseq.bed",sep="\t",header=FALSE)
df2=read.table("intro2R_data/data/rn4.refseq2name.txt",sep="\t",header=FALSE)
```

47. Use `head()` to see what is inside of the the data frames above.
```{r,echo=FALSE,eval=FALSE}
head(df1)
head(df2)
```

48. Merge data sets using `merge()` and assign the results to variable named 'new.df', and use `head()` to see the results.

```{r,echo=FALSE,eval=FALSE}
new.df=merge(df1,df2,by.x="V4",by.y="V1")
head(new.df)
```


### Plotting in R


Please **run the following** for the rest of the exercises. 
```{r}
set.seed(1001)
x1=1:100+rnorm(100,mean=0,sd=15)
y1=1:100
```

44. Make a scatter plot using `x1` and `y1` vectors generated above.

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1)
```


44. Use `main` argument to give a title to `plot()` as in `plot(x,y,main="title")`

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,main="scatter plot")
```

45. Use `xlab` argument to set a label to x-axis.Use `ylab` argument to set a label to y-axis.
```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,main="scatter plot",xlab="x label")
```


46. See what `mtext(side=3,text="hi there")` does. **HINT:** mtext stands for margin text.

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,main="scatter plot",xlab="x label",ylab="y label")
mtext(side=3,text="hi there") 
```


47. See what `mtext(side=2,text="hi there")` does.check your plot after execution.

```{r,echo=FALSE,eval=FALSE}
mtext(side=2,text="hi there") 
mtext(side=4,text="hi there") 
```

48. You can use `paste()` as 'text' argument in mtext() try that, you need to re-plot.
your plot first. **HINT:** `mtext(side=3,text=paste(...))`
See what `paste()` is used for.

```{r}
paste("Text","here")
myText=paste("Text","here")
myText
```
Use *mtext()*  and *paste()* to put a margin text on the plot.

```{r,echo=FALSE,eval=FALSE}
mtext(side=3,text=paste("here","here"))
```

44. `cor()` calculates correlation between two vectors.
Pearson correlation is a measure of the linear correlation (dependence) 
between two variables X and Y.

```{r}
corxy=cor(x1,y1) # calculates pearson correlation
```

44. Try use `mtext()`,`cor()` and `paste()` to display correlation coefficient on your scatterplot ?

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,main="scatter")
corxy=cor(x1,y1) 
#mtext(side=3,text=paste("Pearson Corr.",corxy))
mtext(side=3,text=paste("Pearson Corr.",round(corxy,3) ) )

plot(x1,y1)
mtext(side=3,text=paste("Pearson Corr.",round( cor(x1,y1)  ,3)  ) )
```

44. Change the colors of your plot using `col` argument.
Ex: `plot(x,y,col="red")`

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,col="red")
```

44. Use `pch=19` as an argument in your `plot()` command.

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,col="red",pch=19)
```

44. Use `pch=18` as an argument to your `plot()` command.

```{r,echo=FALSE,eval=FALSE}
plot(x1,y1,col="red",pch=18)
?points
```

44. Make histogram of x1 with `hist()` function.Histogram is a graphical representation of the data distribution.

```{r,echo=FALSE,eval=FALSE}
hist(x1)
```


44. You can change colors with 'col', add labels with 'xlab', 'ylab', and add a 'title' with 'main' arguments. Try all these in a histogram.
 
 ```{r,echo=FALSE,eval=FALSE}
hist(x1,main="title")
```

44. Make boxplot of y1 with `boxplot()`.

```{r,echo=FALSE,eval=FALSE}
boxplot(y1,main="title")
```

44. Make boxplots of `x1` and `y1` vectors in the same plot.

```{r,echo=FALSE,eval=FALSE}
boxplot(x1,y1,ylab="values",main="title")
```

44. In boxplot use horizontal = TRUE  argument

```{r,echo=FALSE,eval=FALSE}
boxplot(x1,y1,ylab="values",main="title",horizontal=TRUE)
```

44. make multiple plots with par(mfrow=c(2,1))
    - run par(mfrow=c(2,1)) 
    - make a boxplot 
    - make a histogram

```{r,echo=FALSE,eval=FALSE}
par( mfrow=c(1,2) )
hist(x1)
boxplot(y1)
```

44. Do the same as above but this time with par(mfrow=c(1,2))

```{r,echo=FALSE,eval=FALSE}
par(mfrow=c(2,2))
hist(x1)
boxplot(y1)
```

44. Save your plot using "Export" button in Rstudio


44. Save your plot by running :
```{r,eval=FALSE}
dev.copy(pdf,file="plot.file.pdf");dev.off()
```

44. Save your plot running :
```{r,eval=FALSE}
dev.copy(png,filename="plot.file.png");dev.off()
```

44. Another way to save the plot is the following

  1. Open a graphics device
  2. Create the plot
  3. Close the graphics device

```{r,echo=FALSE,eval=FALSE}
pdf("~/myplot1.pdf", width = 5, height = 5) # 1.
plot(x1, y1) # 2.
dev.off()   # 3.
```

44. **EXTRA: Making color density scatterplot**. You can make a scatter plot showing density
of points rather than points themselves. If you use points it looks like this:

```{r}

x2=1:1000+rnorm(1000,mean=0,sd=200)
y2=1:1000
plot(x2,y2,pch=19,col="blue")
```

If you use smoothScatter() function, you get the densities. 
```{r}
smoothScatter(x2,y2,colramp=colorRampPalette(c("white","blue", "green","yellow","red"))) 
```
 
Now, plot with `colramp=heat.colors` argument and then use a custom color scale using the following argument.
```
colramp = colorRampPalette(c("white","blue", "green","yellow","red")))
```

```{r,echo=FALSE,eval=FALSE}
smoothScatter(x2,y2,colramp = heat.colors )
smoothScatter(x2,y2,
               colramp = colorRampPalette(c("white","blue", "green","yellow","red")))
```

### Functions and control structures (for, if/else etc.)
Read CpG island data as shown below for the rest of the exercises.

```{r,eval=TRUE,eval=TRUE}
cpgtFilePath=system.file("extdata",
                "CpGi.table.hg18.txt",
                package="compGenomRData")
cpgi=read.table(cpgtFilePath,header=TRUE,sep="\t")
head(cpgi)
```

44. Check values at perGc column using a histogram.
'perGc' stands for GC percent => percentage of C+G nucleotides

```{r,echo=FALSE,eval=FALSE}
hist(cpgi$perGc) # most values are between 60 and 70
```

44. Make a boxplot for 'perGc' column

```{r,echo=FALSE,eval=FALSE}
boxplot(cpgi$perGc) 
```

44. Use if/else structure to decide if given GC percent high, low or medium.
If it is low, high, or medium. low < 60, high>75, medium is between 60 and 75
use greater or less than operators <  or  > . 
Fill in the values in the in code below, where it is written 'YOU_FILL_IN'

```{r,echo=TRUE,eval=FALSE}

GCper=65

  # check if GC value is lower than 60, 
  # assign "low" to result
  if('YOU_FILL_IN'){
    result="low"
    cat("low")
  }
  else if('YOU_FILL_IN'){  # check if GC value is higher than 75, assign "high" to result
    result="high"
    cat("high")
  }else{ # if those two conditions fail then it must be "medium"
    result="medium"
  }

result

```

44. Write a function that takes a value of GC percent and decides
if it is low, high, or medium. low < 60, high>75, medium is between 60 and 75.
Fill in the values in the in code below, where it is written 'YOU_FILL_IN'


```
GCclass<-function(my.gc){
  
  YOU_FILL_IN
  
  return(result)
}
GCclass(10) # should return "low"
GCclass(90) # should return "high"
GCclass(65) # should return "medium"
```

```{r,echo=FALSE,eval=FALSE}
GCclass<-function(my.gc){
  
  result="low"# set initial value
  
  if(my.gc < 60){ # check if GC value is lower than 60, assign "low" to result
    result="low"
  }
  else if(my.gc > 75){  # check if GC value is higher than 75, assign "high" to result
    result="high"
  }else{ # if those two conditions fail then it must be "medium"
    result="medium"
  }
  return(result)
}
GCclass(10) # should return "low"
GCclass(90) # should return "high"
GCclass(65) # should return "medium"
```

44. Use a for loop to get GC percentage classes for gcValues below. Use the function
you wrote above.

```
gcValues=c(10,50,70,65,90)
for( i in YOU_FILL_IN){
  YOU_FILL_IN
}
```

```{r,echo=FALSE,eval=FALSE}
gcValues=c(10,50,70,65,90)
for( i in gcValues){
 
  print(GCclass(i) )
}
```

44. Use `lapply` to get to get GC percentage classes for gcValues. Example: 

```{r,eval=FALSE}
vec=c(1,2,4,5)
power2=function(x){ return(x^2)  }
    lapply(vec,power2)
```

```{r,echo=FALSE,eval=FALSE}
s=lapply(gcValues,GCclass)
```


44. Use sapply to get values to get GC percentage classes for gcValues
```{r,echo=FALSE,eval=FALSE}
sapply(gcValues,GCclass)
```

44. Is there a way to decide on the GC percentage class of given vector of GCpercentages
without using if/else structure and loops ? if so, how can you do it?
**HINT:** subsetting using < and > operators

```{r,echo=FALSE,eval=FALSE}
result=rep("low",length(gcValues) )
result[gcValues > 75]="high"
result[gcValues < 75 & gcValues > 60 ] = "medium"
```