Ranbow is a haplotype assembler for polyploid genomes. It has been developed for the haplotype assembly of the hexaploid sweet potato genome, which is highly heterozygous. Ranbow can also be applied to other polyploid genomes. After a first phasing, Ranbow utilizes the assembled haplotypes to improve the accuracy of variant calling results and to infer the evolutionary history of the organism´s genome. Ranbow has three main modes of function:
ranbow hap
: for haplotyping
ranbow eval
: for evaluating of the assemble haplotypes by gold standard (long) reads
ranbow phylo
: for the phylogenetic analysis
You can download the source code and the toy example here as well: https://www.molgen.mpg.de/4054787/RanbowHaplotper.zip
You can also see the Ranbow webpage: https://www.molgen.mpg.de/ranbow
The code is implemented in python 2.7.13 and can be run from linux shell. To run Ranbow the following libraries and tools should be available:
python libraries and tools
- pysam 0.10.0
- numpy 1.12.1
- samtools 0.1.19-44428cd.
We generated a random list of scaffolds from sweet potato genome in order to make a data input as a toy example. Then, the fasta, bam and vcf files were extracted. Data files for selected scaffolds are named as ‘sel.bam’, ’sel.vcf’, and ‘sel.fasta’. These files are generated by “toy_data.sh” which is available in toy folder as well.
It worth to mention that Ranbow has the feature of being applied on the whole or selected region of the scaffolds. This feature helps for better parallelization as well as phasing targeted interval in the genome.
Ranbow hap has three modes of sub-function, for simplicity we refer to them as different modes, namely index, hap, and collect. The mode of choice is declared by -mode parameter in command line.
python2.7 ranbow.py hap -mode index
python2.7 ranbow.py hap -mode hap
python2.7 ranbow.py hap -mode collect
python2.7 ranbow.py hap -mode modVCF
The “index” mode generates the index files in order to have random access to the chromosomes and scaffolds in case of for parallelization. In “hap” mode, Ranbow construct haplotypes from mapped reads. The “collect” mode collects the haplotypes assembled with different processors and generates the final fasta files, bam files ( aligned haplotypes) and also hap formated file . The bam files are already sorted and indexed and are ready to be loaded in IGV browser for downstream analysis. The hap formatted files include haplotypes information such as name, start position, haplotype sequence, number of error correction, haplotype quality, and supporting fragments. These fields are explained in more details as follows.
Haplotype name: The name includes scaffold/chromosome name, block index and file open index Haplotype sequence: sequence of alleles in which the alleles are coded to numbers in [0, ploidy) interval. For sweet potato they are coded with “0” to “5”. Number of error correction: Each haplotype has a number of supporting fragments. This field represents the number of mismatch among these fragments. Quality of haplotype: The supporting fragment coverage of each position in the haplotype. Length of quality equals to the length of haplotype. Supporting fragments: A full list of supporting fragments.
To run Ranbow, the parameters in the following table need to be adjusted. Some of these parameters have to be passed from the command line (mentioned in following table, third column), while the others could also be passed via parameter file as well. The compulsory parameters are as follows:
Parameter | Compulsory on modes | Command line and/or parameter file | Description |
---|---|---|---|
-mode |
Index, hap, collect, and modVCF | C | This parameter indicates the mode function |
-par |
Index, hap, collect, and modVCF | C | Most of the parameter can be adjusted in this file and be passed by this parameter |
-ploidy |
hap, modVCF | CF | The ploidy of the organism |
-noProcessor |
hap | CF | Number of available processors |
-bamFile |
Index, hap | CF | Sorted bam file of reads |
-refFile |
Index, hap | CF | Fasta file |
-vcfFile |
Index, hap, and modVCF | CF | Vcf file |
-selectedScf |
hap | CF | The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to |
-outputFolderBase |
Index, hap, collect, modVCF | CF | The result of haplotyping will be collected in the following folder. |
-processorIndex |
hap | C | It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. -processorIndex is an index referring to one processor and should be in the following range: [0, -noProcessor ) |
-WinLen |
not compulsory, usage in hap mode | C | sliding window length, Default:8, suggested for Illumina: Default, suggested for long and linked read: 5 to speed up the process of haplotyping |
*C: command line, F: parameter file
Here is the content of parameter file used in this toy example:
-ploidy 6
-noProcessor 4
-bamFile path to folder~>RANBOW/toy/sel.bam
-refFile path to folder~>RANBOW/toy/sel.fasta
-vcfFile path to folder~>RANBOW/toy/sel.vcf
-selectedScf path to folder~>RANBOW/toy/scaffolds.list
-outputFolderBase path to folder~>/RANBOW/toy/result
For simplicity, data files are stored in one folder (named RANBOW/toy) and the all parameters are adjusted in hap.params file. The parameter file is passed to through the -par
argument.
python2.7 ranbow.py hap -par RANBOW/toy/hap.params
So here is the list of files in the folder:
RANBOW/toy > ls -lh
total 30M
-rw-r----- 1 moeinzad moeinzad 426 Mar 31 10:33 hap.params
-rw-r----- 1 moeinzad moeinzad 641 Mar 30 09:51 scaffolds.list
-rw-r----- 1 moeinzad moeinzad 29M Mar 30 11:21 sel.bam
-rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta
-rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf
-rw-r----- 1 moeinzad moeinzad 405 Mar 30 11:18 toy_data.sh
It generates index files for bam, vcf, and fasta if not exist:
python2.7 ranbow.py hap -mode index -par RANBOW/toy/hap.params
Then the index files are generated and listed as follows:
RANBOW/toy > ls -lht
total 30M
drwxr-x--- 2 moeinzad moeinzad 10 Mar 31 10:58 result
-rw-r----- 1 moeinzad moeinzad 844 Mar 31 10:58 sel.vcf.index
-rw-r----- 1 moeinzad moeinzad 225K Mar 31 10:58 sel.bam.bai
-rw-r----- 1 moeinzad moeinzad 1.2K Mar 31 10:58 sel.fasta.fai
-rw-r----- 1 moeinzad moeinzad 426 Mar 31 10:33 hap.params
-rw-r----- 1 moeinzad moeinzad 29M Mar 30 11:21 sel.bam
-rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf
-rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta
-rw-r----- 1 moeinzad moeinzad 405 Mar 30 11:18 toy_data.sh
-rw-r----- 1 moeinzad moeinzad 641 Mar 30 09:51 scaffolds.list
The -noProcessor
should be adjusted according to the number of available processors. For example if the code is running on a cluster with 200 cores, -noProcessor
can be set to 200. Then, 200 independent jobs will be executed with different set of scaffolds. These 200 jobs are independent and can be run on different machines as well. For the sake of simplicity in the toy example, we run the code utilizing 3 cores.
The -processorIndex
is a compulsory parameter in command line if the number of processors is more than 1 (-noProcessor
> 1). Otherwise -noProcessor
is set to 1 and -processorIndex
is set to 0 by default , meaning that the output is generated with one processor and the result is going to be generated in a folder named 0. Here the standard output for the first processor (-processorIndex
= 0 and -noProcessor
= 3) is shown as follows.
py1thon2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex 0
('scaffold6228|size14266', 1, 14266, 'scaffold6228|size14266')
Run ranbow - single mode
scaffold6228|size14266 readBAM 0:00:04.010655 Ranbow_single 0:00:02.986332 single2file 0:00:00.060936
('scaffold12138|size4302', 1, 4302, 'scaffold12138|size4302')
Run ranbow - single mode
scaffold12138|size4302 readBAM 0:00:02.606135 Ranbow_single 0:00:00.253924 single2file 0:00:00.031988
('scaffold15599|size3549', 1, 3549, 'scaffold15599|size3549')
Run ranbow - single mode
scaffold15599|size3549 readBAM 0:00:00.834673 Ranbow_single 0:00:00.043864 single2file 0:00:00.006803
('scaffold18910|size3076', 1, 3076, 'scaffold18910|size3076')
Run ranbow - single mode
scaffold18910|size3076 readBAM 0:00:02.195878 Ranbow_single 0:00:00.220671 single2file 0:00:00.027786
('scaffold22064|size2651', 1, 2651, 'scaffold22064|size2651')
Run ranbow - single mode
scaffold22064|size2651 readBAM 0:00:01.230129 Ranbow_single 0:00:00.155012 single2file 0:00:00.017257
('scaffold25118|size2247', 1, 2247, 'scaffold25118|size2247')
Run ranbow - single mode
scaffold25118|size2247 readBAM 0:00:03.558237 Ranbow_single 0:00:00.663766 single2file 0:00:00.047705
('scaffold27129|size1948', 1, 1948, 'scaffold27129|size1948')
Run ranbow - single mode
scaffold27129|size1948 readBAM 0:00:01.748883 Ranbow_single 0:00:00.245888 single2file 0:00:00.018874
('scaffold29117|size1608', 1, 1608, 'scaffold29117|size1608')
Run ranbow - single mode
scaffold29117|size1608 readBAM 0:00:00.878690 Ranbow_single 0:00:00.033398 single2file 0:00:00.006976
The running time for each part of haplotyping is reported in the standard output. The haplotyping result will appear in the subfolder of -outputFolderBase
named as -processorIndex
. For example, since we generated haplotypes for -processorIndex 0
the only generated folder is called 0
RANBOW/toy/result > ls
0
This folder contains six files as follows:
RANBOW/toy/result/0 > ls -lh
total 964K
-rw-rw---- 1 moeinzad moeinzad 470526 Apr 8 15:35 ranbow.single.hap.sam
-rw-rw---- 1 moeinzad moeinzad 324221 Apr 8 15:35 ranbow.single.frag
-rw-rw---- 1 moeinzad moeinzad 234516 Apr 8 15:35 ranbow.single.hap.fasta
-rw-rw---- 1 moeinzad moeinzad 42188 Apr 8 15:35 ranbow.single.hap
The hap format shows the connectivity between alleles in segments, the quality of alleles in assembled haplotype (the number supporting reads for the allele), and in general all information regarding the assembled haplotype in coded-allele space. Coded-allele space means the shared sequence between alleles are ignored and the alleles are decoded to numbers. The fasta and the bam files for the aligned haplotypes are generated in this folder as well.
To run the code in parallel on one machine the following command can be executed:
for i in {0..3}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done
The standard outputs for each processor is collected in “0.log”, “1.log”, and “2.log” files.
It is also possible to run Ranbow partly in one machine and partly in the other. Following our toy example, set 0 (-processorIndex = 0
) and set 1 are executed in machine A and set 2 is executed in machine B.
Machine A:
for i in {0..1}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done
Machine B:
for i in {2..2}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done
The three generated folder are listed as follows:
RANBOW/toy/result > ls -lh
total 12K
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:00 0
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 1
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 2
The suggested parameters in the previous section and the default paramters are adjusted for Illumina
We suggest to use the same command suggested in previous section plus 5 for -WinLen to speed up the process
python ranbow.py hap -mode hap -par params.text -WinLen 5
When all jobs get finished, -mode collect
can be executed to collect the haplotypes from different machines:
python2.7 ranbow.py hap -mode collect -par hap.params
number of folders: 3
[samopen] SAM header is present: 28 sequences.
[samopen] SAM header is present: 28 sequences.
Then the generated files are as follows:
RANBOW/toy/result > ls -lht
total 1.2M
-rw-rw---- 1 moeinzad moeinzad 172K Apr 8 15:30 ranbow.single.hap.bam
-rw-rw---- 1 moeinzad moeinzad 2.3K Apr 8 15:30 ranbow.single.hap.bam.bai
-rw-rw---- 1 moeinzad moeinzad 330K Apr 8 15:30 ranbow.single.hap
-rw-rw---- 1 moeinzad moeinzad 874K Apr 8 15:30 ranbow.single.hap.fasta
-rw-rw---- 1 moeinzad moeinzad 2.2M Apr 8 15:30 ranbow.single.frag
-rw-rw---- 1 moeinzad moeinzad 47K Apr 8 15:30 ranbow.single.hap.igv.sam
-rw-rw---- 1 moeinzad moeinzad 5.9M Apr 8 15:30 ranbow.single.frag.igv.sam
-rw-rw---- 1 moeinzad moeinzad 3.4K Apr 8 15:30 ranbow.single.hap.igv.fa
-rw-rw---- 1 moeinzad moeinzad 36K Apr 8 15:30 ranbow.single.hap.igvtype.sam
For this toy example the generated single.hap.bam files are ready to be loaded for IGV viewer for further analysis.
The following command modifies and corrects the variants with the aid of assembled haplotypes.
hap/RANBOW > mypy ranbow.py hap -mode modVCF -par params.txt
modified VCF file:
/hap/RANBOW/toy/sel.mod.vcf
log of modified SNPs:
/hap/RANBOW/toy/result/ranbow.modVCF.log
Modified SNPs: 853 Deleted SNPs: 49
The detailed information of the modification can be found in:
/hap/RANBOW/toy/result/ranbow.modVCF.log
To evaluate the haplotyping accuracy, we recruit the Roche 454 trimmed reads mapped to the assembly. This file or other gold standard mapped read files has to be passed with -bamFileEval
parameter. The Ranbow eval has also this option of being executed in different machine in parallel. For obtaining the evaluation result the following steps needs to be done.
python2.7 ranbow.py eval -par hap.params -mode index
mode run in parallel :
python2.7 ranbow.py eval -par hap.params -mode run -processorIndex i
mode collect:
python2.7 ranbow.py eval -par hap.params -mode collect
the parameters are:
Parameter | Compulsory on modes | Command line and/or parameter file | Description |
---|---|---|---|
-mode |
Index, run, and collect | C | This parameter indicates the mode function |
-par |
Index, run, and collect | C | Most of the parameter can be adjusted in this file and be passed by this parameter |
-ploidy |
run | CF | The ploidy of the organism |
-noProcessor |
run | CF | Number of available processors |
-bamFileEval |
Index, hap | CF | Sorted bam file of gold standard reads |
-refFile |
Index, hap | CF | Fasta file |
-vcfFile |
Index, hap, and modVCF | CF | Vcf file |
-selectedScf |
hap | CF | The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to be haplotyped. |
-outputFolderBase |
Index, hap, collect, modVCF | CF | The result of haplotyping will be collected in the following folder. |
-processorIndex |
hap | C | It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. -processorIndex is an index referring to one processor and should be in the following range: [0, -noProcessor ) |
*C: command line, F: parameter file
After running the above commands the following files will be generated:
toy/result > ls -lt eval/
total 28
-rw-r----- 1 moeinzad moeinzad 4561 Apr 16 18:33 result.sing
drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 0
drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 1
drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 2
drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 3
The result.sing file is the evaluation of haplotypes which are assembled from reads. The two columns indicate the “Match”, “Mismatch” of the overlaps between phased haplotypes and 454 reads. In order to investigate more the overlaps, one can take a look at the file named /*/ranbow.single.eval
. The following lines are selected from /0/ranbow.single.eval
file as an example. These lines show the similarity and dissimilarity of the overlaps between 454 and assembled haplotypes in one block. The number in parenthesis show the similarity and dissimilarity in the overlap. The arrow indicates which haplotype is assumed to be sampled from the same chromosomes that the 454 read is also sampled. For instance (8, 1) indicate 8 allele matches and 1 allele mismatches between 454 reads and the second assembled haplotype.
454
17 -------101010100-------
illumina
10 00000000000000000000011 (5, 4)
10 1000000101010110001110- (8, 1) <---
12 --0001000001010001101-- (7, 2)
12 --01-110100111010010--- (4, 5)
11 -1101100011101101010011 (6, 3)
11 -0010100110001110010--- (4, 5)
To infer the evolutionary event of the organism with the aid of haplotypes, Ranbow phylo can be employed to extract and tune the alignments from all phased regions and report the topology of the evolutionary phylogenetic tree. The Ranbow phylo can be executed in parallel mode as well. Therefore the files need to be indexed first.
python2.7 ranbow.py phylo -par hap.params -mode index
mode run in parallel :
python2.7 ranbow.py phylo -par hap.params -mode run -processorIndex i
mode collect:
python2.7 ranbow.py phylo -par hap.params -mode collect
then then collect mode is developed in order to report the final statistics for different phylogenetic tree topologies.
hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode collect
7 ((((()))))
7 (((()())))
29 (((())()))
13 (((()))())
43 ((()())())
26 ((())(()))
Moreover Ranbow phylo -mode
tree calculates the mutation rates in the branch trees reports the relative branch lengths. (only works for hexaploid)
hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode tree
Tree topology: (((A-B)I (C-D)J ) M (E-F) N)
Branch Name Mean relative distance SD relative distance
A: 0.674418604651 0.813288765118
B: 0.674418604651 0.813288765118
I: 0.639534883721 0.721961003548
C: 0.732558139535 1.05295176025
D: 0.732558139535 1.05295176025
J: 0.581395348837 0.604539345654
M: 0.886627906977 1.60414025402
E: 0.779069767442 0.891186337505
F: 0.779069767442 0.891186337505
N: 1.0523255814 1.2325032897
Mutation rate for A-B and C-D branches: 0.00701902094469
Mutation rate for E-F branch: 0.00848135858809
hap/RANBOW >