Viral-Track

This is a forked version of Viral-Track [https://github.com/PierreBSC/Viral-Track]. It is a R-based computational software based on STAR and samtools to detect and identify viruses from single-cell RNA-sequencing (scRNA-seq) raw data. This tool was tested on various scRNA-seq datasets derived from mouse and human infected samples as described in our paper 'Detecting and studying viral infection at the single-cell resolution using Viral-Track'. It has been also used to detect SARS-CoV-2 in scRNA-seq data in here. This version of the repository tracks the changes made in order to run this in the Wayne State Univeristy High Performance Computing Grid.

Installation

Before running Viral-Track, several dependencies have been installed, but require loading the appropriate modules.

module load R
module load star
module load samtools
module load stringtie
module load anaconda3.python
conda activate umitools

1 . R 4.0.0 with all the packages.

module load R

BiocManager::install(c("Biostrings", "ShortRead","doParallel","GenomicAlignments","Gviz","GenomicFeatures","Rsubread"))

2 . Spliced Transcript Alignment to A Reference (STAR) Github.

module load star

3 . Samtools suite is also required, more details here.

module load samtools

4 . The transcript assembler StringTie described here is needed.

module load stringtie

5 . For the single cell demultiplexing we will use UMI-tools and the R package RSubread. See UMI-tools details here.

module load anaconda3python
conda activate umitools

Creation of the Index and of the annotation file

The first step consists in creating a STAR index that include both host and virus reference genomes. To do so first download the ViruSite genome reference database. Host genome has also to be downloaded from the ensembl website. Data from ViruSite has been downloaded here /wsu/home/groups/piquelab/data/viralGenomes/genomes.fasta. The Ensamble version we will use should probably be the same as the one used by cellranger /wsu/home/groups/piquelab/data/refGenome10x/refdata-cellranger-hg19-3.0.0/fasta/genome.fa. This can take some time and requires large amount of memory and storage space : please check that you have at least 32 GB of RAM and more than 100GB of avaible memory.

The STAR index can now be built by typing :

mkdir index
cd index
STAR --runThreadN N --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles /wsu/home/groups/piquelab/data/viralGenomes/genomes.fasta  /wsu/home/groups/piquelab/data/refGenome10x/refdata-cellranger-hg19-3.0.0/fasta/genome.fa

Lastly you need to create a small file that list all the viruses included in the index and their genome length. We provide an example in the Github that corresponds to the VirusSite dataset.

Pre-processing of the single data

Before running any analysis we advise the user to perform the first steps of the demultiplexing on the raw fastq data. This can be done using UMI-tools and is easy for droplet based techniques such as DROP-seq and 10X. An extensive tutorial can be found here. The user should stop at step 3 (included) in order to get filtered and annotated fastq files with the cellular and UMI barcode in the sequence header. An example of this is also included int the script extractUMIs.sh, adjust variables accordingly:

bash extractUMIs.sh

Detection of viruses in scRNA-seq data

We can now start the real analysis. Viral-Track relies on two different text files to run : a file containing the values of all parameters (parameter file) and one containing the path to the sequencing files to analyze (target file). An exemple of each file is provided in the Github.

The parameter file consists in a list of rows where the name of each variable is followed by an equal symbol and then the value of the parameter.

The target file contains the list of the files paths. The files can be either .fastq or .fastq.gz files.

Before running any scanning analysis check the parameter file and make sure that you have set the correct values for :

The output directory (Output_directory variable) : If the directory does not exist it will be created.
The path to the STAR index (Index_genome).
The path to the virus annotation file (Viral_annotation_file).
The number of cores to use (N_thread) : please be carefull as STAR and samtools can comsumme large amount of memory ! Runnning Viral-Track with a too high number of thread can trigger massive memory swapping....

Once this is done you can launch the analysis using :

##Rscript Viral_Track_scanning.R Path/to/Parameter_file.txt Path/to/Target_file.txt
Rscript Viral_Track_scanning.R Parameters.txt Files_to_process.txt

If you want to launch it in the background use instead :

R CMD BATCH Viral_Track_scanning.R Path/to/Parameter_file.txt Path/to/Target_file.txt &

Once the analysis is over, the results can be checked by looking at the output directory. A pdf called QC_report.pdf is automatically generated and described the most important results : The three first panels describe the general quality of the mapping (percentage of mapped reads, mean length of the mapped reads....). The next panels describe the quality of the mapping of each individual virus : three different scatter plots show the number of uniquely mapped reads, the complexity of the sequences, the percentage of genome mapped and the length of the longest mapped contig. By default only viruses with at least 50 uniquely mapped reads, a mean sequence complexity of 1.3 and 10% of the genome mapped is considered as being present in the sample. An example of a QC pdf file can be found in the Github.

Transcriptome assembly

If wanted, the viral transcriptome of the detected viruses can be assembled : while efficient in some case (see the example of the Influenza virus segment 7), the biased coverage of conventionnal scRNA-seq techniques makes the transcript assembly challenging.

Rscript Viral_Track_transcript_assembly.R  Path/to/Parameter_file.txt Path/to/Target_file.txt

The annotated transcriptome is saved as a GTF file (called Merged_GTF.txt) inside a new directory which is named accordingly to the variable 'Name_run' stored in the Parameter file.

Single-cell demultiplexing

Single cell demultiplexing is performed using a wrapper of UMI-tools and Rsubreads : make sure you have processed the fastq data as described above !

The creation of the count table is straightforward : by typing

Rscript Viral_Track_cell_demultiplexing.R  Path/to/Parameter_file.txt Path/to/Target_file.txt

For each sample a TSV table called Expression_table.tsv is generated : it is a conventionnal UMI tab that describes viral gene expression in each cell and can therefore be used for analysis and data integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Viral-Track

Installation

Creation of the Index and of the annotation file

Pre-processing of the single data

Detection of viruses in scRNA-seq data

Transcriptome assembly

Single-cell demultiplexing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Files_to_process.txt		Files_to_process.txt
Parameters.txt		Parameters.txt
QC_report.pdf		QC_report.pdf
README.md		README.md
Viral_Track_cell_demultiplexing.R		Viral_Track_cell_demultiplexing.R
Viral_Track_scanning.R		Viral_Track_scanning.R
Viral_Track_transcript_assembly.R		Viral_Track_transcript_assembly.R
Virusite_annotation_file.txt		Virusite_annotation_file.txt
extractUMIs.sh		extractUMIs.sh
make_parameters.sh		make_parameters.sh
makeindex.sh		makeindex.sh
processVirusResults.R		processVirusResults.R
run_all_extractUMIs.sh		run_all_extractUMIs.sh

piquelab/Viral-Track

Folders and files

Latest commit

History

Repository files navigation

Viral-Track

Installation

Creation of the Index and of the annotation file

Pre-processing of the single data

Detection of viruses in scRNA-seq data

Transcriptome assembly

Single-cell demultiplexing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages