This is a forked version of Viral-Track [https://github.com/PierreBSC/Viral-Track]. It is a R-based computational software based on STAR and samtools to detect and identify viruses from single-cell RNA-sequencing (scRNA-seq) raw data. This tool was tested on various scRNA-seq datasets derived from mouse and human infected samples as described in our paper 'Detecting and studying viral infection at the single-cell resolution using Viral-Track'. It has been also used to detect SARS-CoV-2 in scRNA-seq data in here. This version of the repository tracks the changes made in order to run this in the Wayne State Univeristy High Performance Computing Grid.
Before running Viral-Track, several dependencies have been installed, but require loading the appropriate modules.
module load R
module load star
module load samtools
module load stringtie
module load anaconda3.python
conda activate umitools
1 . R 4.0.0 with all the packages.
module load R
BiocManager::install(c("Biostrings", "ShortRead","doParallel","GenomicAlignments","Gviz","GenomicFeatures","Rsubread"))
2 . Spliced Transcript Alignment to A Reference (STAR) Github.
module load star
3 . Samtools suite is also required, more details here.
module load samtools
4 . The transcript assembler StringTie described here is needed.
module load stringtie
5 . For the single cell demultiplexing we will use UMI-tools and the R package RSubread. See UMI-tools details here.
module load anaconda3python
conda activate umitools
The first step consists in creating a STAR index that include both host and virus reference genomes.
To do so first download the ViruSite genome reference database. Host genome has also to be downloaded from the ensembl website. Data from ViruSite has been downloaded here /wsu/home/groups/piquelab/data/viralGenomes/genomes.fasta
.
The Ensamble version we will use should probably be the same as the one used by cellranger /wsu/home/groups/piquelab/data/refGenome10x/refdata-cellranger-hg19-3.0.0/fasta/genome.fa
.
This can take some time and requires large amount of memory and storage space : please check that you have at least 32 GB of RAM and more than 100GB of avaible memory.
The STAR index can now be built by typing :
mkdir index
cd index
STAR --runThreadN N --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles /wsu/home/groups/piquelab/data/viralGenomes/genomes.fasta /wsu/home/groups/piquelab/data/refGenome10x/refdata-cellranger-hg19-3.0.0/fasta/genome.fa
Lastly you need to create a small file that list all the viruses included in the index and their genome length. We provide an example in the Github that corresponds to the VirusSite dataset.
Before running any analysis we advise the user to perform the first steps of the demultiplexing on the raw fastq data. This can be done using UMI-tools and is easy for droplet based techniques such as DROP-seq and 10X. An extensive tutorial can be found here. The user should stop at step 3 (included) in order to get filtered and annotated fastq files with the cellular and UMI barcode in the sequence header. An example of this is also included int the script extractUMIs.sh
, adjust variables accordingly:
bash extractUMIs.sh
We can now start the real analysis. Viral-Track relies on two different text files to run : a file containing the values of all parameters (parameter file) and one containing the path to the sequencing files to analyze (target file). An exemple of each file is provided in the Github.
The parameter file consists in a list of rows where the name of each variable is followed by an equal symbol and then the value of the parameter.
The target file contains the list of the files paths. The files can be either .fastq or .fastq.gz files.
Before running any scanning analysis check the parameter file and make sure that you have set the correct values for :
- The output directory (Output_directory variable) : If the directory does not exist it will be created.
- The path to the STAR index (Index_genome).
- The path to the virus annotation file (Viral_annotation_file).
- The number of cores to use (N_thread) : please be carefull as STAR and samtools can comsumme large amount of memory ! Runnning Viral-Track with a too high number of thread can trigger massive memory swapping....
Once this is done you can launch the analysis using :
##Rscript Viral_Track_scanning.R Path/to/Parameter_file.txt Path/to/Target_file.txt
Rscript Viral_Track_scanning.R Parameters.txt Files_to_process.txt
If you want to launch it in the background use instead :
R CMD BATCH Viral_Track_scanning.R Path/to/Parameter_file.txt Path/to/Target_file.txt &
Once the analysis is over, the results can be checked by looking at the output directory. A pdf called QC_report.pdf is automatically generated and described the most important results : The three first panels describe the general quality of the mapping (percentage of mapped reads, mean length of the mapped reads....). The next panels describe the quality of the mapping of each individual virus : three different scatter plots show the number of uniquely mapped reads, the complexity of the sequences, the percentage of genome mapped and the length of the longest mapped contig. By default only viruses with at least 50 uniquely mapped reads, a mean sequence complexity of 1.3 and 10% of the genome mapped is considered as being present in the sample. An example of a QC pdf file can be found in the Github.
If wanted, the viral transcriptome of the detected viruses can be assembled : while efficient in some case (see the example of the Influenza virus segment 7), the biased coverage of conventionnal scRNA-seq techniques makes the transcript assembly challenging.
Rscript Viral_Track_transcript_assembly.R Path/to/Parameter_file.txt Path/to/Target_file.txt
The annotated transcriptome is saved as a GTF file (called Merged_GTF.txt) inside a new directory which is named accordingly to the variable 'Name_run' stored in the Parameter file.
Single cell demultiplexing is performed using a wrapper of UMI-tools and Rsubreads : make sure you have processed the fastq data as described above !
The creation of the count table is straightforward : by typing
Rscript Viral_Track_cell_demultiplexing.R Path/to/Parameter_file.txt Path/to/Target_file.txt
For each sample a TSV table called Expression_table.tsv is generated : it is a conventionnal UMI tab that describes viral gene expression in each cell and can therefore be used for analysis and data integration.