Home

HapSeq2 Tutorial

HapSeq2 is a program for haplotype phasing and genotype calling from next-generation sequencing (NGS) data. This page is to demonstrate a pipeline that can generate phased genotypes (VCF format) from read alignments (BAM format).

Warning: This tutorial assumes you are using Linux operating system.

Download HapSeq2

Tools/Softwares Required

Tool/Software Info

vcftools: v0.1.9.0

Perl: v5.8.8

hapseq2: use 64-bit precompiled binary packaged and attached to this page.

bam_parser: use precompiled binary packaged and attached to this page.

Overview bam_parser The starting point are three things:

A file containing a list of individuals.
A directory containing set of BAM files, sorted and indexed, each corresponding to one individual in the list above.
A VCF file containing a SNP site list. Currently HapSeq2 ignore non-SNP polymorphic sites. No genotype information is needed.

The output is a VCF file contained the phased genotypes called over the sites.

Tutorial files Go to here to download a toy example, which is a subset of the 1000 Genomes Project Phase 1 data.

To deploy the tutorial:

tar zxvf HapSeq2_pipeline.tgz

This will generate a directory named HapSeq2_pipeline. Go in and get started

cd HapSeq2_pipeline

Run bam_parser to prepare input for HapSeq2

bam_parser is a program that go over a list of BAM files and extract all relevant input files required by HapSeq2.

We have a wrapper run_bam_parser.pl to call bam_parser for each BAM file and put the resulting files into a temporary directory tmp_input. This can take some time.

mkdir tmp_input

perl run_bam_parser.pl CEU.sample.ID.txt BAMs sites.vcf tmp_input

cat tmp_input/*.count.txt > count.txt

cat tmp_input/*.jump1.txt > jump1.txt

cat tmp_input/*.jump2.txt > jump2.txt

cat tmp_input/*.read.txt > read.txt

cat sites.vcf | grep -v '#' | cut -f 2,4,5 > sites.txt

Now count.txt contains the allele count information of reads covering single sites (R1); jump1.txt contains double site allele count information for reads covering any consecutive sites (R2); and read.txt contains long range read-site coverage information (R3). sites.txt file contains sites information. jump2.txt contains similar information as jump1.txt, but currently is unused. For the detailed description of these files, please refer to HapSeq2 Manual here.

Now we are ready to run HapSeq2.

Run HapSeq2

./hapseq2 --readCounts count.txt --polymorphicSites sites.txt --readHap jump1.txt --seqReadFile read.txt --seed 1 --readForwOpt 1 --mcmcHap 2 --mcmcScale 05 -o res-hapseq --seqError 0.01 --phase --geno --quality -r 20

This is the main run and could take some time. Typically this will be run as a cluster job. Be sure to prepare enough memory and CPU time.

The parameters are the standard HapSeq2 option. See here for a complete manual of HapSeq2 program.

The output files are the same as that of Mach/Thunder, with 'res-hapseq' as prefix.

Converting MACH/Thunder/HapSeq2 results into VCF format

This can be achieved by the following script:

perl mach_to_vcf_phased.pl -m res-hapseq -s sites.txt -i CEU.sample.ID.txt > res-hapseq.vcf

Comparing results to 1000 Genomes Project Phase 1 release

We also included a sub-section from the Omni genotyping results phased by SHAPEIT as part of 1000 Genomes Project phase 1 release.

VCFTools (not included) can be used to compare the results:

vcftools --vcf res-hapseq.vcf --diff Omni.recode.vcf --diff-discordance-matrix --diff-switch-error --out compare

You can check the results by the following:

cat compare.diff.discordance_matrix

wc compare.diff.switch

All-in-one Script

# Setup toy-project environment