-
Notifications
You must be signed in to change notification settings - Fork 1
Home
HapSeq2 is a program for haplotype phasing and genotype calling from next-generation sequencing (NGS) data. This page is to demonstrate a pipeline that can generate phased genotypes (VCF format) from read alignments (BAM format).
Warning: This tutorial assumes you are using Linux operating system.
Tool/Software Info
vcftools: v0.1.9.0
Perl: v5.8.8
hapseq2: use 64-bit precompiled binary packaged and attached to this page.
bam_parser: use precompiled binary packaged and attached to this page.
Overview bam_parser The starting point are three things:
-
A file containing a list of individuals.
-
A directory containing set of BAM files, sorted and indexed, each corresponding to one individual in the list above.
-
A VCF file containing a SNP site list. Currently HapSeq2 ignore non-SNP polymorphic sites. No genotype information is needed.
The output is a VCF file contained the phased genotypes called over the sites.
Tutorial files Go to here to download a toy example, which is a subset of the 1000 Genomes Project Phase 1 data.
To deploy the tutorial:
tar zxvf HapSeq2_pipeline.tgz
This will generate a directory named HapSeq2_pipeline. Go in and get started
cd HapSeq2_pipeline
Run bam_parser to prepare input for HapSeq2
bam_parser is a program that go over a list of BAM files and extract all relevant input files required by HapSeq2.
We have a wrapper run_bam_parser.pl to call bam_parser for each BAM file and put the resulting files into a temporary directory tmp_input. This can take some time.
mkdir tmp_input
perl run_bam_parser.pl CEU.sample.ID.txt BAMs sites.vcf tmp_input
cat tmp_input/*.count.txt > count.txt
cat tmp_input/*.jump1.txt > jump1.txt
cat tmp_input/*.jump2.txt > jump2.txt
cat tmp_input/*.read.txt > read.txt
cat sites.vcf | grep -v '#' | cut -f 2,4,5 > sites.txt
Now count.txt contains the allele count information of reads covering single sites (R1); jump1.txt contains double site allele count information for reads covering any consecutive sites (R2); and read.txt contains long range read-site coverage information (R3). sites.txt file contains sites information. jump2.txt contains similar information as jump1.txt, but currently is unused. For the detailed description of these files, please refer to HapSeq2 Manual here.
Now we are ready to run HapSeq2.
./hapseq2 --readCounts count.txt --polymorphicSites sites.txt --readHap jump1.txt --seqReadFile read.txt --seed 1 --readForwOpt 1 --mcmcHap 2 --mcmcScale 05 -o res-hapseq --seqError 0.01 --phase --geno --quality -r 20
This is the main run and could take some time. Typically this will be run as a cluster job. Be sure to prepare enough memory and CPU time.
The parameters are the standard HapSeq2 option. See here for a complete manual of HapSeq2 program.
The output files are the same as that of Mach/Thunder, with 'res-hapseq' as prefix.
This can be achieved by the following script:
perl mach_to_vcf_phased.pl -m res-hapseq -s sites.txt -i CEU.sample.ID.txt > res-hapseq.vcf
We also included a sub-section from the Omni genotyping results phased by SHAPEIT as part of 1000 Genomes Project phase 1 release.
VCFTools (not included) can be used to compare the results:
vcftools --vcf res-hapseq.vcf --diff Omni.recode.vcf --diff-discordance-matrix --diff-switch-error --out compare
You can check the results by the following:
cat compare.diff.discordance_matrix
wc compare.diff.switch
tar zxvf HapSeq2_pipeline.tgz
cd HapSeq2_pipeline
# Run bam_parser to prepare input for HapSeq2
mkdir tmp_input
perl run_bam_parser.pl CEU.sample.ID.txt BAMs sites.vcf tmp_input
cat tmp_input/*.count.txt > count.txt
cat tmp_input/*.jump1.txt > jump1.txt
cat tmp_input/*.jump2.txt > jump2.txt
cat tmp_input/*.read.txt > read.txt
cat sites.vcf | grep -v '#' | cut -f 2,4,5 > sites.txt
./hapseq2 --readCounts count.txt --polymorphicSites sites.txt --readHap jump1.txt --seqReadFile read.txt --seed 1 -o res-thunder --seqError 0.01 --phase --geno --quality -r 20
./hapseq2 --readCounts count.txt --polymorphicSites sites.txt --readHap jump1.txt --seqReadFile read.txt --seed 1 --readForwOpt 1 --mcmcHap 2 --mcmcScale 05 -o res-hapseq --seqError 0.01 --phase --geno --quality -r 20
perl mach_to_vcf_phased.pl -m res-hapseq -s sites.txt -i CEU.sample.ID.txt > res-hapseq.vcf
vcftools --vcf res-hapseq.vcf --diff Omni.recode.vcf --diff-discordance-matrix --diff-switch-error --out compare
cat compare.diff.discordance_matrix
wc compare.diff.switch