Skip to content
mjsull edited this page Aug 6, 2015 · 7 revisions

Simluated data from 3 strains of Clamydia pecorum

This tutorial will cover the approach for generating BAM and VCF for read data containing three Chlamydia pecorum strains and visualizing the haplotypes with HapFlow. In the tutorial folder are two FASTQ files containing simulated paired-end Illumina reads based on the complete genome of C. pecorum E58 at 20x coverage, C. pecorum PV3056 at 10x coverage and C. pecorum PV787 at 5x coverage mixed together to represent a mixed infection. The FASTQ files are called "mixed_3strains_R1.fastq" and "mixed_3strains_R2.fastq". Also included in the tutorial folder is a FASTA file of the complete genome C. pecorum W73, which will be used as a reference genome.

  1. The BAM file can be created by aligning the simulated Illumina reads against the reference genome, C. pecorum W37 using BWA and Samtools.Ã�Â
% bwa index Cpecorum_W37.fasta

% bwa aln Cpecorum_W73.fasta mixed_3strains_R1.fastq > read1.sai

% bwa aln Cpecorum_W73.fasta mixed_3strains_R2.fastq > read2.sai

% bwa sampe Cpecorum_W73.fasta read1.sai read2.sai mixed_3strains_R1.fastq mixed_3strains_R2.fastq > mixed_3strains_bwa.sam

% samtools faidx Cpecorum_W73.fasta

% samtools import Cpecorum_W73.fasta mixed_3strains_bwa.sam mixed_3strains_bwa.bam

% samtools sort mixed_3strains_bwa.bam mixed_3strains_bwa.sorted

% samtools index mixed_3strains_bwa.sorted.bam
  1. The VCF file can be created using Freebayes
% freebayes -f Cpecorum_W73.fasta -p 3 –F 0.03 mixed_3strains_bwa.sorted.bam > mixed_3strains_bwa.vcf

-p 3 sets ploidy to 3

-F 0.03 – only show variants present in more than 0.03 of the reads.

  1. Launch HapFlow by double clicking on the executable or from the command-line.Ã�Â
% python HapFlow.py
  1. In the top menu bar, select File > Create Flow File and load "mixed_3strains_bwa.sorted.bam" in the BAM file box and "mixed_3strains_bwa.vcf" for the VCF file box. Save the output as "mixed_3strains_bwa.ftw." This step may take a few minutes.

Create flow menu

The Create Flow menu is where the BAM and VCF file make the flow file.

  1. Select File > Load Flow File and load "mixed_3strains_bwa.ftw". The following screen will appear in a few seconds. The x-axis can be extended by selecting View > Stretch X or by hitting "D" on the keyboard.

HapFlow diagram

Hapflow diagram of simulated Illumina read data.

  1. The orange rectangle with vertical lines represents where the variants are located within the displayed section of the genome, these lines are extended below and spaced an equal distance apart in the area where the flows are viewed. Each flow consisting of one or more reads is represented as one or more arrows overlapping each variant line that the reads of the flow align to. Width of the arrow represents the number of reads within that flow. A solid line joins variants on the same read of a pair.

  2. Click on the arrows to highlight individual flows. Figure 3 show a section of the HapFlow profile where the three strains are visualised as flows. Figure 3A-C shows the flows containing a different combination of three single nucleotide polymorphisms (SNPs). Figure 3A represent C. pecorum E58 the most dominate strain (20x coverage) in the read data; hence why this flow has the thickest arrows of the three strains. Figure 3B show a flow associated with C. pecorum PV3056 the second most prevalent strain (10x coverage). Figure 3C marks the flow for C. pecorum P787, which is the least prevalent strain (5x coverage). Right click on the flow will display the options to retrieve the read names for flow or extract the reads as a BAM file.

A section of the Hapflow profile showing three SNPs

A section of Hapflow profile showing three SNPs (the black vertical lines in orange bar). Panel A-C shows three separate haplotypes (black arrows) representing the three C. pecorum strains. A) C. pecorum E58 flow. B) C. pecorum PV3056 flow. C) C. pecorum P787 flow.