The cDNA and gDNA library processing are independent pipelines located in the folders rna_pipeline
and dna_pipeline
. The gDNA pipeline includes processing for both single-cell/single-nucleus NMT-seq and sciMET, as they share major processes.
- Ensure that
Nextflow
is installed and accessible in your system's PATH. - Place
main.nf
andnextflow.config
in your run folder e.g. a directory for processing data from a specific experiment batch. - To build Singularity image: use the
def
files in thesingularity
folder. - Modify
nextflow.config
: adjust the parameters listed to suit your own need; change the reference genome files to their location; setsingularity.cacheDir
to your own scratch folder with sufficient space; replaceaccessToken
with your personal Nextflow Tower token for workflow monitoring; change the Singularity image (sif file) locations correspondingly to where your containers have been built and stored.
For NMT-seq (use rna_pipeline
and dna_pipeline
for cDNA and gDNA libraries, respectively):
nohup nextflow run main.nf > pipeline.log 2>&1 &
For sciMET (dna_pipeline
):
nextflow run main.nf -entry pipeline_sciMET > pipeline.log 2>&1 &
-
Raw reads QC: generate
FastQC
&MultiQC
report. Setrun_fastqc=false
if you do not need this. -
Trimming: trim off adapters and low-quality bases from the 3' ends by
TrimGalore
, and filter by read length of 36 bp. -
Alignment: generate genome and transcriptome alignment BAM files by
STAR
. -
Filtering: filter aligned reads by
Samtools
: unmapped reads and secondary alignments are filtered out, and only uniquely mapped reads are kept. -
Gene expression quantification:
- for single-cell data, specify
rsem = true
innextflow.config
to use RSEM for quantification on the gene level based on the transcriptome alignment BAM files; this only counts exonic mapped reads, and accounts for bias in the read start position distribution. - for single-nucleus data, specify
featurecount_gene = true
innextflow.config
to usefeatureCounts
for quantification on the gene level.
- for single-cell data, specify
-
To generate a comprehensive summary file:
python rna_pipeline/bin/collect.RNAalign.nf.py
-
Raw reads QC: generate
FastQC
&MultiQC
report. Setrun_fastqc=false
if you do not need this. -
Trimming of scNMT-seq: use
TrimGalore
to perform adapter and base quality trimming, hard-clip 20 bp from the 5' end and 7 bp from the 3' end, and filter by read length of 36 bp. -
Trimming of sciMET: use
TrimGalore
to perform adapter and base quality trimming, and filter by read length. -
Alignment: alignment with
Bismark
that callsBowtie2
in--local
(allowing for soft-clipping)--non_directional
(essential for scNMT-seq PBAT libraries) mode. -
Deduplication:
deduplicate_bismark
is used to remove duplicates. -
Filtering: filter aligned reads by
Samtools
: only uniquely mapped reads are kept; reads with mapping quality (MAPQ) score below 30 are filtered out, and only uniquely mapped reads are kept. -
Downsampling reads from BAM files: randomly sample 10%, 20%, 40%, 60%, and 80% of reads from the Bismark output BAM files by
Samtools
. Afterwards, deduplication, filtering, and computing coverage are performed on the downsampled BAM files. This is for estimating a sequencing saturation curve for each single-cell library. -
Computing coverage: compute genome-wide coverage of individual BAM files using
Samtools
. -
Generating BED files: convert BAM files to BED files that indicate read coverage positions using
BEDTools
. The BED files can be used for copy-number calling on the single-cell level with the software of your choice. Setgenerate_bedfile=false
if this is not needed. -
Context-specific methylation calling:
- Option 1: Enable
bismark_extract=true
to usebismark_methylation_extractor
to assess the methylation status of each cytosine, andcoverage2cytosine
to create a genome-wide cytosine report. The--nome-seq
option is used for NMT-seq contexts. - Option 2: Enable
allc_extract=true
to useMethylpy
for methylation calling to convert BAM files to ALLC format. We include the upstream 1 base to take GpC context into account for NMT-seq. Extraction of different cytosine contexts was then carried out byextract-allc
module ofALLCools
. The single-cell ALLC files can be directly used as input for downsteam analysis.- for scNMT-seq: we use
--mc_contexts WCGN
to extract CpG sites (reflecting DNA methylation), and--mc_contexts GCHN
for GpC sites (reflecting chromatin accessibility due to GpC methyltransferase treatment); - for sciMET: we use
--mc_contexts NCGN
to extract CG sites, and--mc_contexts NCHN
to extract CH sites.
- for scNMT-seq: we use
- Option 1: Enable
-
To generate a comprehensive summary file:
- For NMT-seq:
python dna_pipeline/bin/generate_allc_summary.py --input_dir allc --out summary/allc_summary.tsv python dna_pipeline/bin/generate_mapping_summary.NMT.py
- For sciMET:
python dna_pipeline/bin/count_barcodes_from_fq.sciMET.py --input_dir trim --out summary/trimmed_reads_summary.all_barcodes.tsv & sh dna_pipeline/bin/concat_mapping_summary.sciMET.sh bam_sc summary/uniq_mapped_reads_summary.all_barcodes.tsv & python dna_pipeline/bin/generate_allc_summary.py --input_dir allc --out summary/allc_summary.tsv & wait python dna_pipeline/bin/generate_metadata.allCells.py --input_dir summary --out summary/metadata.allCells.csv
- For NMT-seq:
- Directly plot saturation curve per library (NMT)
- Add troubleshooting: contamination profile
- Add standalone pipeline for sample demultiplexing (after user-directed QC filtering of nuclei)
- Add standalone pipeline for MethSCAn processes and converting mC matrices to more storage-efficient h5 files