A Snakemake workflow for calling and annotation of short variants.
Workflow takes paired-end Illumina short read data (fastq files) as input and outputs annotated variant calls in a vcf file as the final result.
Input directory contains PE Illumina reads from a publicly available SARS-CoV-2 dataset SRA accession SRR15660643 downsampled to 16000 paired reads (sample.R1.paired.fq.gz and sample.R2.paired.fq.gz).
A fasta file with the Wuhan-Hu-1 reference genome Genbank accession MN908947.3 is included in the
reference directory (MN908947.3.fasta), along with the VEP cache for successful annotation of genomic features.
git clone https://github.com/LorenaDerezanin/pipeline_test
Step 1: Install Miniconda
Minimal conda installer for running pipeline in an isolated conda environment to avoid dependency hell and ensure reproducibility.
conda install mamba -n base -c conda-forge
Recommended installation to speed up env setup. Mamba is a more robust and faster package manager (parallel download of data), and handles releases and dependencies better than conda. If continuing with conda
, mamba
command should be replaced with conda
in Step 3.
cd pipeline_test/
mamba env create -n snek -f envs/snek.yml
conda activate snek
snakemake --use-conda --cores 4 --verbose
Number of suggested --cores
when running pipeline locally, should be increased if running on a cluster.
If conda fails to install snakemake v.6.15
, install snakemake with mamba: mamba install snakemake
.
Bioinformatics tools used in the Snakemake workflow, in the form of snakemake wrappers obtained from The Snakemake Wrappers Repository:
- fastQC
- multiQC
- trim_galore
- bwa
- samtools
- picard
- freebayes
- bcftools
- vep
- to do:
- Docker container + conda/mamba
- AWS/Google cloud deployment
- unit tests