This workflow ingests public data from NCBI and outputs curated metadata and sequences that can be used as input for the phylogenetic workflow.
If you have another data source or private data that needs to be formatted for the phylogenetic workflow, then you can use a similar workflow to curate your own data.
The workflow can be run from the top level pathogen repo directory:
nextstrain build ingest
Alternatively, the workflow can also be run from within the ingest directory:
cd ingest
nextstrain build .
This produces the default outputs of the ingest workflow:
- metadata = results/metadata.tsv
- sequences = results/sequences.fasta
The workflow has a target for dumping the full raw metadata from NCBI Datasets.
nextstrain build ingest dump_ncbi_dataset_report
This will produce the file ingest/data/ncbi_dataset_report_raw.tsv
,
which you can inspect to determine what fields and data to use if you want to
configure the workflow for your pathogen.
The defaults directory contains all of the default configurations for the ingest workflow.
defaults/config.yaml contains all of the default configuration parameters
used for the ingest workflow. Use Snakemake's --configfile
/--config
options to override these default values.
The rules directory contains separate Snakefiles (*.smk
) as modules of the core ingest workflow.
The modules of the workflow are in separate files to keep the main ingest Snakefile succinct and organized.
The workdir
is hardcoded to be the ingest directory so all filepaths for
inputs/outputs should be relative to the ingest directory.
Modules are all included in the main Snakefile in the order that they are expected to run.
The build-configs directory contains custom configs and rules that override and/or extend the default workflow.
- nextstrain-automation - automated internal Nextstrain builds.
This repository uses git subrepo
to manage copies of ingest scripts in vendored, from nextstrain/ingest.
See vendored/README.md for instructions on how to update the vendored scripts.