RNA-Bloom is a fast and memory-efficient de novo transcript sequence assembler. It is designed for the following sequencing data types:
- single-end/paired-end bulk RNA-seq (strand-specific/agnostic)
- paired-end single-cell RNA-seq (strand-specific/agnostic)
- long-read RNA-seq (ONT cDNA/direct RNA, PacBio cDNA)
Written by Ka Ming Nip 📧
©️ 2018-present Canada's Michael Smith Genome Sciences Centre, BC Cancer
-
Java SE Development Kit (JDK) 11 (JDK 17 is slightly faster)
-
External software used:
software | short reads | long reads |
---|---|---|
minimap2 >=2.22 | required | required |
Racon | not used | required |
ntCard >=1.2.1 | required | required |
PATH
!
RNA-Bloom can be installed in two ways:
conda install -c bioconda rnabloom
mamba install -c bioconda rnabloom
All dependent software (listed above) will be installed. RNA-Bloom can be run as rnabloom ...
- Download the binary tarball
rnabloom_vX.X.X.tar.gz
from the releases section. - Extract the downloaded tarball with the command:
tar -zxf rnabloom_vX.X.X.tar.gz
RNA-Bloom can be run as java -jar /path/to/RNA-Bloom.jar ...
ℹ️ Note that -left
, -right
, -sef
, and -ser
can accept multiple file paths separated by the whitespace character.
- paired-end reads only
- when
left
reads are sense andright
reads are antisense, use-revcomp-right
to reverse-complementright
reads - when
left
reads are antisense andright
reads are sense, use-revcomp-left
to reverse-complementleft
reads - for non-stranded data, use either
-revcomp-right
or-revcomp-left
- when
java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -t THREADS -outdir OUTDIR
- single-end reads only
- use
-sef
for forward reads and-ser
for reverse reads
- use
java -jar RNA-Bloom.jar -sef SE.fastq -t THREADS -outdir OUTDIR
- paired-end and single-end reads
java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -sef SE.fastq -t THREADS -outdir OUTDIR
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than length threshold |
rnabloom.transcripts.nr.fa |
assembled transcripts with redundancy reduced |
java -jar RNA-Bloom.jar -pool READSLIST.txt -revcomp-right -t THREADS -outdir OUTDIR
This is especially useful for single-cell datasets. RNA-Bloom was tested on Smart-seq2 and SMARTer datasets. It is not supported for long-read data (-long
) at this time.
This is a tabular file that describes the read file paths for all cells/samples to be used pooled assembly.
- Column header is on the first line, leading with
#
- Columns are separated by space/tab characters
- Each sample can have more than one lines; lines sharing the same
name
will be grouped together during assembly
column | description |
---|---|
name |
sample name |
left |
path to one left read file |
right |
path to one right read file |
sef |
path to one single-end forward read file |
ser |
path to one single-end reverse read file |
Only name
, left
, and right
columns are specified for a total of 3 columns. The legacy header-less tri-column format is still supported.
#name left right
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq
In addition to name
, left
, and right
columns, either sef
, ser
or both are specified for a total of 4~5 columns.
#name left right sef ser
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq /path/to/cell1/sef.fastq /path/to/cell1/ser.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq /path/to/cell2/sef.fastq /path/to/cell2/ser.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq /path/to/cell3/sef.fastq /path/to/cell3/ser.fastq
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than length threshold |
rnabloom.transcripts.nr.fa |
assembled transcripts with redundancy reduced |
java -jar RNA-Bloom.jar -stranded ...
The -stranded
option indicates that input reads are strand-specific.
Strand-specific reads are typically in the F2R1 orientation, where /2
denotes left reads in forward orientation and /1
denotes right reads in reverse orientation.
Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:
-stranded -left /path/to/reads_2.fastq -right /path/to/reads_1.fastq -revcomp-right
and for scRNA-seq data:
cell1 /path/to/cell1/reads_2.fastq /path/to/cell1/reads_1.fastq
java -jar RNA-Bloom.jar -ref TRANSCRIPTS.fasta ...
The -ref
option specifies the reference transcriptome FASTA file for guiding short-read assembly. It is not supported for long-read data (-long
) at this time.
1
, 2
, 3
), which could be in conflict with RNA-Bloom's sequence IDs. Please rename your read IDs (with seqtk rename
) if necessary.
ℹ️ Note that -long
, -sef
, and -ser
can accept multiple file paths separated by the whitespace character.
Default presets for -long
are intended for ONT data. Please add the -lrpb
flag for PacBio data.
java -jar RNA-Bloom.jar -long LONG.fastq -t THREADS -outdir OUTDIR
Input reads are expected to be in a mix of both forward and reverse orientations.
Options -pool
and -ref
are not supported for long-read data at this time.
java -jar RNA-Bloom.jar -long LONG.fastq -stranded -t THREADS -outdir OUTDIR
Input reads are expected to be only in the forward orientation.
By default, uracil (U
) is written as T
. Use the -uracil
option to write U
instead of T
in the output assembly.
ntCard v1.2.1 supports uracil in reads.
cDNA data:
java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT.fastq -t THREADS -outdir OUTDIR
direct RNA data:
java -jar RNA-Bloom.jar -stranded -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than min. length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than min. length threshold |
If ntcard
is found in your PATH
, then the -ntcard
option is automatically turned on to count the number of unique k-mers in your reads.
java -jar RNA-Bloom.jar -fpr 0.01 ...
This sets the size of Bloom filters automatically to accommodate a false positive rate (FPR) of ~1%.
Alternatively, you can specify the exact number of unique k-mers:
java -jar RNA-Bloom.jar -fpr 0.01 -nk 28077715 ...
This sets the size of Bloom filters automatically to accommodate 28,077,715 unique k-mers for a FPR of ~1%.
As a rule of thumb, a lower FPR may result in a better assembly but requires more memory for a larger Bloom filter.
java -jar RNA-Bloom.jar -mem 10 ...
This sets the total size to 10 GB. If neither -nk
, -ntcard
, or -mem
are used, then the total size is configured based on the size of input read files.
java -jar RNA-Bloom.jar -stage N ...
N | short reads | long reads |
---|---|---|
1 | construct graph | construct graph |
2 | assemble fragments | correct reads |
3 | assemble transcripts | assemble transcripts |
This is a very useful option if you only want to assemble fragments or correct long reads (ie. with -stage 2
)!
java -jar RNA-Bloom.jar -help
java -Xmx2g -jar RNA-Bloom.jar ...
or if you installed with conda
:
export JAVA_TOOL_OPTIONS="-Xmx2g"
rnabloom ...
This limits the maximum Java heap to 2 GB with the -Xmx
option. Note that java
options has no effect on Bloom filter sizes.
See documentation for other JVM options.
RNA-Bloom is written in Java with Apache NetBeans IDE. It uses the following libraries:
If you use RNA-Bloom in your work, please cite our manuscript(s).
Ka Ming Nip, Saber Hafezqorani, Kristina K. Gagalova, Readman Chiu, Chen Yang, René L. Warren, and Inanc Birol. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nature Communications. 2023 May 22;14(1):2940. doi: 10.1038/s41467-023-38553-y
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, and Inanc Birol. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020 Aug;30(8):1191-1200. doi: 10.1101/gr.260174.119. Epub 2020 Aug 17.