Generic viral Illumina sequence analysis pipeline
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with a Singularity container making installation trivial and results highly reproducible.
nf-villumina
will
- remove low quality reads (fastp)
- filter for reads from a taxonomic group of interest (by default superkingdom Viruses (taxid=10239)) using Kraken2 and Centrifuge classification results
- perform de novo assembly with [Unicycler] and [Shovill] on the taxonomic classification filtered reads
- search all contig sequences using NCBI nucleotide BLAST against a database of your choice (we recommend the version 5 NCBI nt DB)
NOTE: You will need to create/download databases for Kraken2, Centrifuge and BLAST in order to get the most out of this workflow!
Taxonomic Classification for Kraken2 and Centrifuge
For taxonomic classification with Kraken2 and Centrifuge, you will need to download (or build) databases for these programs so that you may use them within the nf-villumina
workflow.
You can point to the Kraken2 and Centrifuge database with export KRAKEN2_DB=/path/to/kraken2/database
and export CENTRIFUGE_DB=/path/to/centrifuge/database/prefix
in your ~/.bashrc
so you don't need to specify it each time you run the workflow with --kraken2_db /path/to/kraken2/standard2 --centrifuge_db /path/to/centrifuge/nt-2018-03-03/nt
- MiniKraken2_v2_8GB: (5.5GB) 8GB Kraken 2 Database built from the Refseq bacteria, archaea, and viral libraries and the GRCh38 human genome
- GTDB_r89_54k Kraken2 DBs: There are multiple Kraken2 DBs of various sizes available for download. For more info, see https://github.com/rrwick/Metagenomics-Index-Correction and the manuscript: Méric, Wick et al. (2019) Correcting index databases improves metagenomic studies. doi: https://doi.org/10.1101/712166
- NCBI nucleotide non-redundant sequences (2018-03-03) (64 GB)
- GTDB_r89_54k Centrifuge DB (108 GB tar file): For more info, see https://github.com/rrwick/Metagenomics-Index-Correction and the manuscript: Méric, Wick et al. (2019) Correcting index databases improves metagenomic studies. doi: https://doi.org/10.1101/712166
BLAST DBs
For nf-villumina, you must have a version 5 BLAST DB with embedded taxonomic information installed, e.g. version 5 nt
DB (see https://ftp.ncbi.nlm.nih.gov/blast/db/v5/)
You can download pre-built BLAST DBs like nt
and nr
from the NCBI FTP site using the update_blastdb.pl
script included with your install of BLAST+ to download and/or update your local BLAST databases.
Show all available databases:
$ update_blastdb.pl --showall
Download the BLASTDB version 5 "nt" database to your current directory decompressing files and deleting original compressed archives:
update_blastdb.pl --blastdb_version 5 nt --decompress
NOTE: For ease of use, all databases should be downloaded to the same directory (e.g. /opt/DB/blast
set in $BLASTDB
environment variable in your ~/.bashrc
)
Check that your database has been downloaded properly and has taxids associated with the sequences contained within it:
$ blastdbcheck -db nt -must_have_taxids -verbosity 3
The peterk87/nf-villumina pipeline comes with documentation about the pipeline, found in the docs/
directory:
- Installation
- Pipeline configuration
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
peterk87/nf-villumina was originally written by Peter Kruczkiewicz.
Bootstrapped with nf-core/tools nf-core create
.
Thank you to the nf-core/tools team for a great tool for bootstrapping creation of a production ready Nextflow workflows.