Ranbow (Version 2.0): a haplotype assembler for polyploid genomes

Ranbow is a haplotype assembler for polyploid genomes. It has been developed for the haplotype assembly of the hexaploid sweet potato genome, which is highly heterozygous. Ranbow can also be applied to other polyploid genomes. After a first phasing, Ranbow utilizes the assembled haplotypes to improve the accuracy of variant calling results and to infer the evolutionary history of the organism´s genome. Ranbow has three main modes of function:

ranbow hap: for haplotyping ranbow eval: for evaluating of the assemble haplotypes by gold standard (long) reads ranbow phylo: for the phylogenetic analysis

Availability

You can download the source code and the toy example here as well: https://www.molgen.mpg.de/4054787/RanbowHaplotper.zip

You can also see the Ranbow webpage: https://www.molgen.mpg.de/ranbow

Dependencies

The code is implemented in python 2.7.13 and can be run from linux shell. To run Ranbow the following libraries and tools should be available:

python libraries and tools

pysam 0.10.0
numpy 1.12.1
samtools 0.1.19-44428cd.

Data preparation

We generated a random list of scaffolds from sweet potato genome in order to make a data input as a toy example. Then, the fasta, bam and vcf files were extracted. Data files for selected scaffolds are named as ‘sel.bam’, ’sel.vcf’, and ‘sel.fasta’. These files are generated by “toy_data.sh” which is available in toy folder as well.

It worth to mention that Ranbow has the feature of being applied on the whole or selected region of the scaffolds. This feature helps for better parallelization as well as phasing targeted interval in the genome.

Ranbow hap

Ranbow hap has three modes of sub-function, for simplicity we refer to them as different modes, namely index, hap, and collect. The mode of choice is declared by -mode parameter in command line.

python2.7  ranbow.py hap -mode index
python2.7  ranbow.py hap -mode hap
python2.7  ranbow.py hap -mode collect
python2.7  ranbow.py hap -mode modVCF

The “index” mode generates the index files in order to have random access to the chromosomes and scaffolds in case of for parallelization. In “hap” mode, Ranbow construct haplotypes from mapped reads. The “collect” mode collects the haplotypes assembled with different processors and generates the final fasta files, bam files ( aligned haplotypes) and also hap formated file . The bam files are already sorted and indexed and are ready to be loaded in IGV browser for downstream analysis. The hap formatted files include haplotypes information such as name, start position, haplotype sequence, number of error correction, haplotype quality, and supporting fragments. These fields are explained in more details as follows.

Haplotype name: The name includes scaffold/chromosome name, block index and file open index Haplotype sequence: sequence of alleles in which the alleles are coded to numbers in [0, ploidy) interval. For sweet potato they are coded with “0” to “5”. Number of error correction: Each haplotype has a number of supporting fragments. This field represents the number of mismatch among these fragments. Quality of haplotype: The supporting fragment coverage of each position in the haplotype. Length of quality equals to the length of haplotype. Supporting fragments: A full list of supporting fragments.

To run Ranbow, the parameters in the following table need to be adjusted. Some of these parameters have to be passed from the command line (mentioned in following table, third column), while the others could also be passed via parameter file as well. The compulsory parameters are as follows:

Parameter	Compulsory on modes	Command line and/or parameter file	Description
`-mode`	Index, hap, collect, and modVCF	C	This parameter indicates the mode function
`-par`	Index, hap, collect, and modVCF	C	Most of the parameter can be adjusted in this file and be passed by this parameter
`-ploidy`	hap, modVCF	CF	The ploidy of the organism
`-noProcessor`	hap	CF	Number of available processors
`-bamFile`	Index, hap	CF	Sorted bam file of reads
`-refFile`	Index, hap	CF	Fasta file
`-vcfFile`	Index, hap, and modVCF	CF	Vcf file
`-selectedScf`	hap	CF	The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to
`-outputFolderBase`	Index, hap, collect, modVCF	CF	The result of haplotyping will be collected in the following folder.
`-processorIndex`	hap	C	It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. `-processorIndex` is an index referring to one processor and should be in the following range: [0, `-noProcessor`)
`-WinLen`	not compulsory, usage in hap mode	C	sliding window length, Default:8, suggested for Illumina: Default, suggested for long and linked read: 5 to speed up the process of haplotyping

*C: command line, F: parameter file

Here is the content of parameter file used in this toy example:

-ploidy 6
-noProcessor    4
-bamFile    path to folder~>RANBOW/toy/sel.bam
-refFile    path to folder~>RANBOW/toy/sel.fasta
-vcfFile    path to folder~>RANBOW/toy/sel.vcf
-selectedScf    path to folder~>RANBOW/toy/scaffolds.list
-outputFolderBase    path to folder~>/RANBOW/toy/result

For simplicity, data files are stored in one folder (named RANBOW/toy) and the all parameters are adjusted in hap.params file. The parameter file is passed to through the -par argument.

python2.7  ranbow.py hap -par RANBOW/toy/hap.params

So here is the list of files in the folder:

RANBOW/toy > ls -lh
total 30M
-rw-r----- 1 moeinzad moeinzad  426 Mar 31 10:33 hap.params
-rw-r----- 1 moeinzad moeinzad  641 Mar 30 09:51 scaffolds.list
-rw-r----- 1 moeinzad moeinzad  29M Mar 30 11:21 sel.bam
-rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta
-rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf
-rw-r----- 1 moeinzad moeinzad  405 Mar 30 11:18 toy_data.sh

Indexing inputs: bam, fasta, and vcf files (mode: index)

It generates index files for bam, vcf, and fasta if not exist:

python2.7  ranbow.py hap -mode index -par RANBOW/toy/hap.params

Then the index files are generated and listed as follows:

RANBOW/toy > ls -lht
total 30M
drwxr-x--- 2 moeinzad moeinzad   10 Mar 31 10:58 result
-rw-r----- 1 moeinzad moeinzad  844 Mar 31 10:58 sel.vcf.index
-rw-r----- 1 moeinzad moeinzad 225K Mar 31 10:58 sel.bam.bai
-rw-r----- 1 moeinzad moeinzad 1.2K Mar 31 10:58 sel.fasta.fai
-rw-r----- 1 moeinzad moeinzad  426 Mar 31 10:33 hap.params
-rw-r----- 1 moeinzad moeinzad  29M Mar 30 11:21 sel.bam
-rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf
-rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta
-rw-r----- 1 moeinzad moeinzad  405 Mar 30 11:18 toy_data.sh
-rw-r----- 1 moeinzad moeinzad  641 Mar 30 09:51 scaffolds.list

Running haplotyper (mode: hap)

The -noProcessor should be adjusted according to the number of available processors. For example if the code is running on a cluster with 200 cores, -noProcessor can be set to 200. Then, 200 independent jobs will be executed with different set of scaffolds. These 200 jobs are independent and can be run on different machines as well. For the sake of simplicity in the toy example, we run the code utilizing 3 cores.

The -processorIndex is a compulsory parameter in command line if the number of processors is more than 1 (-noProcessor > 1). Otherwise -noProcessor is set to 1 and -processorIndex is set to 0 by default , meaning that the output is generated with one processor and the result is going to be generated in a folder named 0. Here the standard output for the first processor (-processorIndex = 0 and -noProcessor = 3) is shown as follows.

py1thon2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex 0

('scaffold6228|size14266', 1, 14266, 'scaffold6228|size14266')
Run ranbow - single mode
scaffold6228|size14266  readBAM 0:00:04.010655  Ranbow_single   0:00:02.986332  single2file     0:00:00.060936
('scaffold12138|size4302', 1, 4302, 'scaffold12138|size4302')
Run ranbow - single mode
scaffold12138|size4302  readBAM 0:00:02.606135  Ranbow_single   0:00:00.253924  single2file     0:00:00.031988
('scaffold15599|size3549', 1, 3549, 'scaffold15599|size3549')
Run ranbow - single mode
scaffold15599|size3549  readBAM 0:00:00.834673  Ranbow_single   0:00:00.043864  single2file     0:00:00.006803
('scaffold18910|size3076', 1, 3076, 'scaffold18910|size3076')
Run ranbow - single mode
scaffold18910|size3076  readBAM 0:00:02.195878  Ranbow_single   0:00:00.220671  single2file     0:00:00.027786
('scaffold22064|size2651', 1, 2651, 'scaffold22064|size2651')
Run ranbow - single mode
scaffold22064|size2651  readBAM 0:00:01.230129  Ranbow_single   0:00:00.155012  single2file     0:00:00.017257
('scaffold25118|size2247', 1, 2247, 'scaffold25118|size2247')
Run ranbow - single mode
scaffold25118|size2247  readBAM 0:00:03.558237  Ranbow_single   0:00:00.663766  single2file     0:00:00.047705
('scaffold27129|size1948', 1, 1948, 'scaffold27129|size1948')
Run ranbow - single mode
scaffold27129|size1948  readBAM 0:00:01.748883  Ranbow_single   0:00:00.245888  single2file     0:00:00.018874
('scaffold29117|size1608', 1, 1608, 'scaffold29117|size1608')
Run ranbow - single mode
scaffold29117|size1608  readBAM 0:00:00.878690  Ranbow_single   0:00:00.033398  single2file     0:00:00.006976

The running time for each part of haplotyping is reported in the standard output. The haplotyping result will appear in the subfolder of -outputFolderBase named as -processorIndex. For example, since we generated haplotypes for -processorIndex 0 the only generated folder is called 0

RANBOW/toy/result > ls
0

This folder contains six files as follows:

RANBOW/toy/result/0 > ls -lh
total 964K
-rw-rw---- 1 moeinzad moeinzad 470526 Apr  8 15:35 ranbow.single.hap.sam
-rw-rw---- 1 moeinzad moeinzad 324221 Apr  8 15:35 ranbow.single.frag
-rw-rw---- 1 moeinzad moeinzad 234516 Apr  8 15:35 ranbow.single.hap.fasta
-rw-rw---- 1 moeinzad moeinzad  42188 Apr  8 15:35 ranbow.single.hap

The hap format shows the connectivity between alleles in segments, the quality of alleles in assembled haplotype (the number supporting reads for the allele), and in general all information regarding the assembled haplotype in coded-allele space. Coded-allele space means the shared sequence between alleles are ignored and the alleles are decoded to numbers. The fasta and the bam files for the aligned haplotypes are generated in this folder as well.

To run the code in parallel on one machine the following command can be executed:

for i in {0..3}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done

The standard outputs for each processor is collected in “0.log”, “1.log”, and “2.log” files.

It is also possible to run Ranbow partly in one machine and partly in the other. Following our toy example, set 0 (-processorIndex = 0) and set 1 are executed in machine A and set 2 is executed in machine B.

Machine A:

for i in {0..1}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done

Machine B:

for i in {2..2}
do
python2.7 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log &
done

The three generated folder are listed as follows:

RANBOW/toy/result > ls -lh
total 12K
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:00 0
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 1
drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 2

Suggested command for different technologies

(Illumina)

The suggested parameters in the previous section and the default paramters are adjusted for Illumina

(Pacbio, Nanopore) Long read technologies like Pacbio CCS, Pacbio CLR, or Nanopore

We suggest to use the same command suggested in previous section plus 5 for -WinLen to speed up the process

python ranbow.py hap -mode hap  -par params.text -WinLen 5

Collecting generated data from different processors (mode: collect)

When all jobs get finished, -mode collect can be executed to collect the haplotypes from different machines:

python2.7 ranbow.py hap -mode collect -par hap.params
number of folders: 3
[samopen] SAM header is present: 28 sequences.
[samopen] SAM header is present: 28 sequences.

Then the generated files are as follows:

RANBOW/toy/result > ls -lht
total 1.2M

-rw-rw---- 1 moeinzad moeinzad 172K Apr  8 15:30 ranbow.single.hap.bam
-rw-rw---- 1 moeinzad moeinzad 2.3K Apr  8 15:30 ranbow.single.hap.bam.bai
-rw-rw---- 1 moeinzad moeinzad 330K Apr  8 15:30 ranbow.single.hap
-rw-rw---- 1 moeinzad moeinzad 874K Apr  8 15:30 ranbow.single.hap.fasta
-rw-rw---- 1 moeinzad moeinzad 2.2M Apr  8 15:30 ranbow.single.frag
-rw-rw---- 1 moeinzad moeinzad  47K Apr  8 15:30 ranbow.single.hap.igv.sam
-rw-rw---- 1 moeinzad moeinzad 5.9M Apr  8 15:30 ranbow.single.frag.igv.sam
-rw-rw---- 1 moeinzad moeinzad 3.4K Apr  8 15:30 ranbow.single.hap.igv.fa
-rw-rw---- 1 moeinzad moeinzad  36K Apr  8 15:30 ranbow.single.hap.igvtype.sam

For this toy example the generated single.hap.bam files are ready to be loaded for IGV viewer for further analysis.

Revising sequence variants of vcf file (mode: modVCF)

The following command modifies and corrects the variants with the aid of assembled haplotypes.

hap/RANBOW > mypy ranbow.py hap -mode modVCF -par params.txt
modified VCF file:
/hap/RANBOW/toy/sel.mod.vcf
log of modified SNPs:
/hap/RANBOW/toy/result/ranbow.modVCF.log
Modified SNPs: 853     Deleted SNPs: 49

The detailed information of the modification can be found in:

/hap/RANBOW/toy/result/ranbow.modVCF.log

Ranbow eval

To evaluate the haplotyping accuracy, we recruit the Roche 454 trimmed reads mapped to the assembly. This file or other gold standard mapped read files has to be passed with -bamFileEval parameter. The Ranbow eval has also this option of being executed in different machine in parallel. For obtaining the evaluation result the following steps needs to be done.

python2.7 ranbow.py eval -par hap.params -mode index

mode run in parallel :

python2.7 ranbow.py eval -par hap.params -mode run -processorIndex i

mode collect:

python2.7 ranbow.py eval -par hap.params -mode collect

the parameters are:

Parameter	Compulsory on modes	Command line and/or parameter file	Description
`-mode`	Index, run, and collect	C	This parameter indicates the mode function
`-par`	Index, run, and collect	C	Most of the parameter can be adjusted in this file and be passed by this parameter
`-ploidy`	run	CF	The ploidy of the organism
`-noProcessor`	run	CF	Number of available processors
`-bamFileEval`	Index, hap	CF	Sorted bam file of gold standard reads
`-refFile`	Index, hap	CF	Fasta file
`-vcfFile`	Index, hap, and modVCF	CF	Vcf file
`-selectedScf`	hap	CF	The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to be haplotyped.
`-outputFolderBase`	Index, hap, collect, modVCF	CF	The result of haplotyping will be collected in the following folder.
`-processorIndex`	hap	C	It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. `-processorIndex` is an index referring to one processor and should be in the following range: [0, `-noProcessor`)

*C: command line, F: parameter file

After running the above commands the following files will be generated:

toy/result > ls -lt eval/
total 28
-rw-r----- 1 moeinzad moeinzad  4561 Apr 16 18:33 result.sing
drwxr-x--- 2 moeinzad moeinzad    66 Apr 16 18:33 0
drwxr-x--- 2 moeinzad moeinzad    66 Apr 16 18:33 1
drwxr-x--- 2 moeinzad moeinzad    66 Apr 16 18:33 2
drwxr-x--- 2 moeinzad moeinzad    66 Apr 16 18:33 3

The result.sing file is the evaluation of haplotypes which are assembled from reads. The two columns indicate the “Match”, “Mismatch” of the overlaps between phased haplotypes and 454 reads. In order to investigate more the overlaps, one can take a look at the file named /*/ranbow.single.eval. The following lines are selected from /0/ranbow.single.eval file as an example. These lines show the similarity and dissimilarity of the overlaps between 454 and assembled haplotypes in one block. The number in parenthesis show the similarity and dissimilarity in the overlap. The arrow indicates which haplotype is assumed to be sampled from the same chromosomes that the 454 read is also sampled. For instance (8, 1) indicate 8 allele matches and 1 allele mismatches between 454 reads and the second assembled haplotype.

454
   17   -------101010100-------
illumina
   10   00000000000000000000011 (5, 4)
   10   1000000101010110001110- (8, 1)     <---
   12   --0001000001010001101-- (7, 2)
   12   --01-110100111010010--- (4, 5)
   11   -1101100011101101010011 (6, 3)
   11   -0010100110001110010--- (4, 5)

Ranbow phylo

To infer the evolutionary event of the organism with the aid of haplotypes, Ranbow phylo can be employed to extract and tune the alignments from all phased regions and report the topology of the evolutionary phylogenetic tree. The Ranbow phylo can be executed in parallel mode as well. Therefore the files need to be indexed first.

python2.7 ranbow.py phylo -par hap.params -mode index

mode run in parallel :

python2.7 ranbow.py phylo -par hap.params -mode run -processorIndex i

mode collect:

python2.7 ranbow.py phylo -par hap.params -mode collect

then then collect mode is developed in order to report the final statistics for different phylogenetic tree topologies.

hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode collect
7 ((((()))))
7 (((()())))
29 (((())()))
13 (((()))())
43 ((()())())
26 ((())(()))

Moreover Ranbow phylo -mode tree calculates the mutation rates in the branch trees reports the relative branch lengths. (only works for hexaploid)

hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode tree
Tree topology:        (((A-B)I (C-D)J ) M (E-F) N)
Branch Name    Mean relative distance    SD relative distance
A:        0.674418604651         0.813288765118
B:        0.674418604651         0.813288765118
I:        0.639534883721         0.721961003548
C:        0.732558139535         1.05295176025
D:        0.732558139535         1.05295176025
J:        0.581395348837         0.604539345654
M:        0.886627906977         1.60414025402
E:        0.779069767442         0.891186337505
F:        0.779069767442         0.891186337505
N:        1.0523255814         1.2325032897

Mutation rate for A-B and C-D branches: 0.00701902094469
Mutation rate for E-F branch: 0.00848135858809
hap/RANBOW >

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
toy		toy
IGV.py		IGV.py
README.md		README.md
eval_collect.sh		eval_collect.sh
hap_collect.sh		hap_collect.sh
haplotyper.py		haplotyper.py
megacc		megacc
methods.py		methods.py
options.txt		options.txt
phylo.py		phylo.py
phylo.sh		phylo.sh
phylo_collect.sh		phylo_collect.sh
ranbow.py		ranbow.py
vcf_correction.py		vcf_correction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ranbow (Version 2.0): a haplotype assembler for polyploid genomes

Availability

Dependencies

Data preparation

Ranbow hap

Indexing inputs: bam, fasta, and vcf files (mode: index)

Running haplotyper (mode: hap)

Suggested command for different technologies

(Illumina)

(Pacbio, Nanopore) Long read technologies like Pacbio CCS, Pacbio CLR, or Nanopore

Collecting generated data from different processors (mode: collect)

Revising sequence variants of vcf file (mode: modVCF)

Ranbow eval

Ranbow phylo

About

Releases

Packages

Languages

moeinzadeh/Ranbow

Folders and files

Latest commit

History

Repository files navigation

Ranbow (Version 2.0): a haplotype assembler for polyploid genomes

Availability

Dependencies

Data preparation

Ranbow hap

Indexing inputs: bam, fasta, and vcf files (mode: index)

Running haplotyper (mode: hap)

Suggested command for different technologies

(Illumina)

(Pacbio, Nanopore) Long read technologies like Pacbio CCS, Pacbio CLR, or Nanopore

Collecting generated data from different processors (mode: collect)

Revising sequence variants of vcf file (mode: modVCF)

Ranbow eval

Ranbow phylo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages