DRAGEN v4.4 introduces support for the Illumina Single Cell 3' RNA Prep library based on PIPseq technology.
The DRAGEN scRNA Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene expression matrix. The pipeline is compatible with library designs that have one read in a fragment matched to a transcript and the other containing a cell barcode and a unique molecular identifier (UMI).
The pipeline includes the following functions:
Cell-barcode and UMI extraction and error correction for each read
RNA-Seq (splice-aware) alignment and matching to annotated genes
UMI counting per cell and gene to measure gene expression
Sparse matrix output and QC metrics
Feature counting, such as with cell-surface proteins
The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline. For information, see DRAGEN RNA Pipeline. Other RNA-Seq modules, such as gene fusion calling or transcript-level gene expression quantification are not supported for Single-Cell RNA.
Note: Illumina's PIPseq technology utilizes a new and novel way for molecular counting [1]. While file inputs and outputs are similar, the sequences described below as UMIs are used as binning indexes (BIs), and molecules are identified by a combination of BIs and intrinsic molecular identifiers (IMIs). The single-cell RNA pipeline is adapted to process these sequences when PIPseq mode is enabled with --scrna-enable-pipseq-mode=true
. Furthermore, some arguments such as barcode positions, barcode whitelist and trimming are set automatically in PIPseq mode as well. Please see DRAGEN Single-Cell RNA PIPseq Pipeline for more information.
Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. For example, build using --ht-build-rna-hashtable=true
. The pipeline also requires a gene annotation file in GTF format, provided with the --annotation
(-a
) option. It is recommended to exclude from the analysis pseudogenes and other entries that may overlap with protein-coding genes and reduce mapping accuracy. This can be done using the --annotation-file-ignore-biotypes argument, which is a comma-separated list of biotypes to exclude (allowed values are protein_coding, pseudogene, lncrna, rrna and shortrna).
The DRAGEN scRNA Pipeline requires both the transcript sequence and the barcode+UMI sequence for each fragment (read) as input. The transcript sequence is aligned to the genome to determine the expressed gene, the barcode+UMI sequence is split into the single-cell barcode to identify the unique cell, and a single-cell UMI for unique molecule quantification. When starting from FASTQ, you can either include the UMI in the read name or provide separate UMI FASTQ files.
Provide the transcriptome reads as a single-end FASTQ file with the Barcode+UMI sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz
) as the transcript read.
In the example, the sequence starting with AGA is the barcode+UMI read, the sequences beginning with GTG and ACG are the i7 and i5 sequences, respectively, and the sequence starting with TGCA is the transcriptome read.
These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode+UMI read. If using bclConvert, enter the barcode/UMI information using the OverrideCycles1
setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: bclConvert refers to the entire single-cell barcode+UMI sequence as UMI.
Enter the following command line option to use the generated FASTQ files from bclConvert:
The option is also compatible with the --fastq-list
input options and with read input from BAM files.
A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file (usually FASTQ file with reads containing cell-barcodes and UMIs)
Read2File
Read 2 FASTQ file (usually transcriptomic reads)
In this case, DRAGEN needs to accept --umi-source=read1
(or --umi-source=read2
if swapped) command-line option. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=read1
option:
Alternatively, the FASTQ file with transcriptomic reads can be specified under Read1File
column and the FASTQ file with cell-barcodes and UMIs - under UmiFile
column. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=umifile
option:
UMI Fastq
FilesAn alternative option is to provide the transcript and barcode+UMI sequences as two separate FASTQ files. One file contains only the transcriptome reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.
Enter the following command line option to use the separate UMI FASTQ files:
To use this method with multiple FASTQ files:
Enter the barcode+UMI FASTQ files as read1 in the fastq-list
file, and then enter the transcriptome read FASTQ files, matching the default fastq_list.csv generated by bclConvert, as read2.
Use the following command:
The scRNA pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
To use the scRNA workflow, use the options --enable-rna=true --enable-single-cell-rna=true
. This section includes information on additional scRNA settings.
By default the scRNA workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to define the location of the single-cell barcode and single-cell UMI in the barcode read:
blockPos
describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>
. For example, for a library with a 16 bp cell-barcode followed by a 10 bp UMI, use: --scrna-barcode-position 0_15 --scrna-umi-position 16_25
. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, use: --scrna-barcode-position=0_8+21_29+43_51 --scrna-umi-position=52_59
.
By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, use: --scrna-barcode-position 0_15:+
or --scrna-barcode-position=0_8+21_29+43_51:+
. To explicitly specify the reverse strand, use: --scrna-barcode-position 0_15:-
or --scrna-barcode-position=0_8+21_29+43_51:-
.
UMI position can also be specified for feature reads, which are reads with a sequence tag specific to a feature (eg, cell-surface protein or antibody). When the feature-specific UMIs are located on Read 2, you can use --scrna-feature-barcode-r2umi=0_11
to specify a 12 bp feature UMI at the beginning of each feature read.
You can provide a list of cell barcode sequences to include using the following command:
In the case where the --scrna-barcode-position
parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-
, e.g.:
The input file can also be provided compressed with gzip (*.txt.gz
).
During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. If the barcodes cannot be corrected, they are filtered out.
DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells
--- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold
--- Specify the method for determining the count threshold value. The available values are fixed
, ratio
, or inflection
.
If using ratio
, DRAGEN estimates the expected number of cells as max(T_e, T_m)
. T_m
is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e
is a threshold based on a fraction of the least abundant expected cell.
If using inflection
, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
If using fixed
, the count threshold is set to force the expected number of cells (--single-cell-number-cells
option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby
--- [Optional] Set the count distribution to consider for cell filtering. Can be either "umi" (default) or "read".
To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
The following are additional options you can use to configure the Single-Cell RNA Pipeline settings.
--rna-library-type
--- Set the orientation of transcript reads relative to the genomes. Enter SF
for forward, SR
for reverse, or U
for unstranded. The default is SF
.
--scrna-count-introns
--- Include intronic reads in gene expression estimation. The default is false
.
--qc-enable-depth-metrics
--- Set to false
to disable depth metrics for faster run time. The default is true
.
--bypass-anchor-mapping
--- Set to true
to disable RNA anchor (two-pass) mapping for increased performance. The default is false
.
The following is an example command line to run the DRAGEN Single Cell RNA Pipeline.
Single-cell RNA outputs are found in the standard DRAGEN output location. The files start with a <prefix>
which may correspond to <sample>.
in the case of a single library and <sample>.<libId>.
in the case of multiple libraries. All single-cell RNA output files contain word scRNA
in their names. Note that the term "molecular identifier (MI)" can refer to either UMI counts in a UMI workflow or molecule counts using the BI/IMI approach in PIPseq mode.
The following three files provide information per-cell gene expression level in matrix market (*.mtx
) format:
<prefix>.scRNA.matrix.mtx.gz
Count of unique MIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix.
NOTE: the legacy <prefix>.scRNA.genes.tsv.gz
is still generated but will be deprecated in the next release.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scRNA.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix, barcode and features files which only include the per-cell gene expression for the PASS
cells:
<prefix>.scRNA.filtered.matrix.mtx.gz
Count of unique MIs for each filtered cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
<prefix>.scRNA.filtered.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix (identical to <prefix>.scRNA.features.tsv.gz
).
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in Python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy
in Python).
The matrix can be converted into a "dense" representation through two Python modules: scanpy
and pandas
. This has been tested with Python 3.9.7, scanpy 1.10.3, pandas 2.2.3.
First, it is necessary to install the required libraries:
Within Python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
Alignments of the transcript reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB
tag containing the cell-barcode and an RX
tag containing the UMI. The alignments use the original sequences without any errors corrected. Fragments that do not have an associated barcode read, for example fragments trimmed on the input data, do not have XB
and RX
tags.
The <prefix>.scRNA_metrics.csv
file contains per sample scRNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing the cell barcode and UMI.
Invalid barcode read: Overall barcode sequence (cell barcode + UMI) failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.
Unique exon match: Reads with valid cell-barcode and UMI that match a unique gene.
Unique intron match: Reads do not match exons, but introns of exactly one gene. For example, if using the command --scrna-count-introns=true
.
Ambiguous match: Reads match to multiple genes.
Wrong strand: Reads overlap a gene on the opposite strand defined by library type.
Mitochondrial reads: Reads map to the mitochondrial example, if there is a matching gene.
No gene match: Reads do not match to any gene. Includes intronic reads unless using --scrna-count-introns=true
.
Filtered multimapper: Reads excluded due to multiple alignment positions in the genome.
Feature reads: Reads matching to features, when using feature counting.
Total counted reads: Reads with valid cell-barcode and MIs, matching a unique gene
Reads with error-corrected molecular identifier sequences: Counted reads where the MI was error-corrected to match another similar sequence.
Reads with invalid molecular identifier sequences: Reads that were not counted due to an invalid MI sequence. For example, pure homopolymer reads or reads containing Ns.
Sequencing saturation: Fraction of reads with duplicate MIs. 1 - ( MIs / Reads).
Unique cell-barcodes: Overall number of unique cell-barcode sequences in counted reads only.
Total molecules: Overall number of unique cell-barcode and MI combinations counted.
Molecule count threshold for passing cells: Number of MIs required for a cell-barcode to pass filtering. NOTE: if --single-cell-threshold-filter-by
is set to read
then Read count threshold for passing cells
is reported here instead.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction genic reads in cells: Counted reads assigned to cells that passed the filters.
Fraction reads in putative cells: All counted reads assigned to cells that passed the filters.
Median reads per cells: Total counted reads per cell that passed the filters.
Median molecules per cells: Total counted MIs per cell that passed the filters.
Median genes per cells: Genes with at least one MI per cell that passed the filters.
Total genes detected: Genes with at least one MI in at least one cell that passed the filters.
The <prefix>.scRNA.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction.
ID: Unique numeric ID for the cell-barcode. The ID corresponds to the line number of that barcode in the MI count matrix (*.mtx) output.
Barcode: The cell-barcode sequence.
TotalReads: Total reads with the cell-barcode sequence. This includes error corrected reads.
GeneReads: Reads (primary read alignments) counted towards a gene.
Molecules: Total number of MIs in counted reads.
Genes: Unique genes detected.
Mitochondrial Reads: Reads mapped to mitochondrial genome.
Filter: The following are the available filter values:
PASS
: Cell-barcode passes the filter.
LOW
: MI count is below threshold.
Cell-barcode sequences from the input reads are error corrected based on their frequency, and optionally through a list of expected cell-barcode sequences. A cell-barcode sequence is corrected into another cell-barcode sequence if they differ only by one base (Hamming distance 1) and:
Either the corrected cell-barcode is at least two times more frequent across all input reads
Or the corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not
When corrected, all the original cell-barcode reads are assigned to the corrected cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020)¹.
¹Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 15 October 2020].
To avoid overcounting UMIs based on sequence errors, UMI error correction is performed among all reads with the same cell-barcode mapping to the same gene. UMI sequences that are likely errors of another UMI are not counted.
DRAGEN implements several strategies for demultiplexing data sets that represent mixtures of cells from different individuals, such as cells pooled in one library prep or microfluidic run. Two of these strategies include a genotype-based and genotype-free demultiplexing. In genotype demultiplexing methods, DRAGEN can assign sample identity to cells based on alleles observed in reads in each cell (only SNVs are considered). DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
To use genotype-based sample demultiplexing, you must provide a VCF file with genotypes for each sample in the data set. To use genotype-free sample demultiplexing, you must provide a VCF file with a set of external samples preferably coming from a population with the same genetic background. The GT
field represents the sample genotypes.
For information on the cell-hashing demultiplexing method, see Cell-Hashing.
You can use the following command line options for scRNA demultiplexing.
One of the two:
--scrna-demux-sample-vcf
— If using genotype-based sample demultiplexing, specify the VCF file that contains the sample genotypes.
--scrna-demux-reference-vcf
— If using genotype-free sample demultiplexing, specify the VCF file that contains the genotypes of a population with a similar genetic background to the samples you are using.
--scrna-demux-detect-doublets
— Enable the doublet detection in genotype-based sample demultiplexing. The default value is false
.
--scrna-demux-number-samples
— The number of samples you are using. This option is only applicable when using an external VCF reference specified with the scrna-demux-reference-vcf
option, for genotype-free sample demultiplexing.
The following is an example command line to run the DRAGEN Single-Cell RNA Pipeline with genotype-based demultiplexing.
The following is an example command line to run the DRAGEN Single-Cell RNA Pipeline with genotype-free demultiplexing.
You can find information related to the output of genotype-based scRNA sample demultiplexing in the following three files.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv
metrics.
SampleIdentity
The SampleIdentity
column can contain the following values::
sampleX
—The particular cell (barcode) is uniquely assigned to a sample.
AMB(sampleX,sampleY)
—The algorithm cannot determine the sample to assign the barcode to.
MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)
—The cell barcode is classified as doublet. For example, MIX(50*sampleX+50*sampleY)
.
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the doublet status of the cell as singlet
, ambiguous
, or doublet
, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status). The higher values of identity Q-score correspond to more confident sample identity calls.
The <prefix>.scRNA.demux.tsv
file contains sample demultiplexing statistics that were used to infer sample identity of each cell.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of SNPs that the reads of the cell barcode intersect.
DemuxReadCount
Number of MIs of the cell barcode that intersect at least one SNP.
Pure samples
Samples from the VCF file.
BestMixtureIdentity
Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true
.
BestMixtureLogLikelihood
The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true
.
The <prefix>.scRNA.demuxSamples_metrics.csv
file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv
.
Passing cells
The number of cell barcodes that passed.
Fraction genic reads in cells
Counted reads assigned to the cells that passed.
Median reads per cell
Total counted reads per cell that passed the filters.
Median molecules per cell
Total counted MIs per cell that passed the filters.
Median genes per cell
Genes with at least one MI per cell that passed the filters.
In addition to cell demultiplexing based on sample or population genotypes, there is an additional option to calculate read likelihoods for all scRNA reads conditioned on a given list of SNVs. The SNVs are specified by providing a somatic VCF (typically from a DNA WES tumor-normal somatic VC output). Note that the input VCF is "trusted" completely (ignoring variant call quality). The read-likelihoods are combined at the cell level and used to classify cells as tumor or non-tumor.
--scrna-demux-tumor-normal-vcf
— A tumor-normal somatic variant call VCF that has tumor and normal sample columns. It is recommended for the VCF to have more than ~100 confident SNVs but that threshold may vary based on the coverage of the scRNA data. Having too few somatic variants runs the risk of not having enough scRNA reads intersecting the SNVs, leading to most cells being unclassified due to insufficient information.
--scrna-demux-tn-threshold
— The log-likelihood ratio threshold for classifying a cell as tumor or non-tumor. The default value is set to 2.0, but we recommend user to test control samples (e.g. high and low tumor-fraction samples) to find an optimal threshold tailored to their experimental design and sample type.
Example DRAGEN Single Cell RNA command line using genotype-based demultiplexing:
The output for tumor/non-tumor scRNA cell demultiplexing can be found in the following three files.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv
metrics.
SampleIdentity
The SampleIdentity
column can contain the following values::
SNG:TUMOR' or 'SNG:NORMAL
—The particular cell (barcode) is uniquely classified as a tumor or non-tumor cell, respectively.
AMB(NORMAL,TUMOR)
—The algorithm has found variant-supporting reads, but the likelihoods of the cell being tumor or non-tumor are the same, and therefore an ambiguous call is made.
NA
—The cell barcode has no variant-supporting reads, so there is not enough information to make a classification.
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the identity of the cell, the identity Q-score is defined as -10 * log10(probability of the most likely identity / (probability of the most likely identity + probability of the secound most likely identity)). Higher values of identity Q-score correspond to more confident sample identity calls.
The <prefix>.scRNA.demux.tsv
file contains demultiplexing statistics that are used to infer the identity of each cell. Only cells in which variant-supporting reads are found are recorded in this file.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of variants that the reads of the cell barcode intersect.
DemuxReadCount
The number of MIs of the cell barcode that intersect at least one variant.
NORMAL
The log-likelihood of the cell being non-tumor
TUMOR
The log-likelihood of a cell being tumor.
The <prefix>.scRNA.demuxSamples_metrics.csv
file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv
. Ambiguous calls, i.e. 'AMB(NORMAL,TUMOR)', are excluded from the calculation.
Passing cells
The number of cell barcodes that are PASS.
Fraction genic reads in cells
Fraction of gene-coding reads assigned to PASS cells.
Median reads per cell
Median count of reads per cell that passed the filters.
Median molecules per cell
Median count of MIs per cell that passed the filters.
Median genes per cell
Median count of genes with at least one MI per cell that passed the filters.
DRAGEN implements several strategies for demultiplexing of data sets that represent mixtures of cells from different individuals, such as cells from different individuals pooled in one library prep or microfluidic run. One of these methods is a sample oligo-tag based method, referred to as cell-hashing.
To use cell-hashing, you must provide a cell-hashing CSV or FASTA reference file. In CSV format, the feature barcode reference file uses the following header: id,name,read,position,sequence,feature_type
.
id
— Identifier of the feature. For example, ADT_A1018.
name
— Name of the feature. For example, ADT_Hu.HLA.DR.DP.DQ_A1018.
read
— Read 1 (R1) or Read 2 (R2).
position
— Position on the specified read, including starting position and the length of the feature barcode. For example, a position of 0_15 represents a feature barcode that starts at position 0 and has a length of 15.
sequence
— DNA sequence of the feature barcode. For example, CAGCCCGATTAAGGT.
feature_type
— Type of the feature. For example, Antibody Capture.
To enable cell-hashing sample demultiplexing, specify the following command line options.
--scrna-cell-hashing-reference
--- Specify a CSV or FASTA cell-hashing reference file that contains sample-specific oligo-tags.
--scrna-demux-detect-doublets
--- Enable doublet detection in cell-hashing sample demultiplexing. The default value is false
.
--scrna-demux-sample-fastq
--- Output sample-specific FASTQ files. See Sample-Specific FASTQ Output Files for more information.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following column in the <prefix>.scRNA.barcodeSummary.tsv
contains cell-hashing per-cell information. For more information on the <prefix>.scRNA.barcodeSummary.tsv
file.
SampleIdentity
The SampleIdentity
column can contain the following values::
sampleX
—The particular cell (barcode) is uniquely assigned to a sample.
AMB(sampleX,sampleY)
—The algorithm cannot determine the sample to assign the barcode to.
MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)
—The cell barcode is classified as doublet. For example, MIX(50*sampleX+50*sampleY)
.
The <prefix>.scRNA.demux.tsv
file contains sample demultiplexing statistics that were used to infer sample identity of each cell.
Barcode
The cell barcode associated with the cell.
Pure samples
Cell-hashing read count for each sample.
If you have enabled either of the sample demultiplexing algorithms, you can output sample-specific FASTQ files after the sample identities for each cell is available using the command line option --scrna-demux-sample-fastq
.
If gzip
is specified, then the sample-specific output FASTQ files are compressed in gzip format. If fastq
is specified, then the sample-specific FASTQ files are not compressed. The default option is none
, which indicates that no sample-specific FASTQ files are produced.
Feature counting is a technique to profile the expression of proteins (e.g., cell surface proteins or antibodies). Feature reads differ from RNA reads in that their read R2 is not a transcriptomic sequence, but rather an oligo tag corresponding to a particular type of protein.
To enable feature counting, specify the following command-line options:
--scrna-feature-barcode-reference
— Specify a CSV or FASTA feature reference file that contains feature barcodes. The CSV file format is similar to the format used for Cell-Hashing.
--scrna-feature-barcode-groups
— If feature reads are specified as separate FASTQ files, specify a comma-separated list of read groups that correspond to feature FASTQ files.
The output of feature counting is appended to the output expression matrix. The extended expression matrix contains additional rows corresponding to the features.
Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.
The DRAGEN PIPseq scRNA Pipeline is designed to process data obtained with Illumina’s proprietary Illumina Single Cell 3' RNA Prep library based on PIPseq technology [1]. The command line option --scrna-enable-pipseq-mode=true
runs the scRNA pipeline with methods that are optimized to process the unique structure of the Illumina Single Cell 3' RNA Prep library. All scRNA pipelines take in the same input files and produce the same output files.
See for more information on the library prep.
The purpose of this step is to:
Assign barcodes to reads
Remove unwanted technical sequences from the data before mapping
Assign binning indices (BIs) and intrinsic molecular identifiers (IMIs) to reads
The figure below describes the fragmentation process taking place in PIPseq chemistry. Each read in R1 includes a barcode sequence, which identifies the individual PIP, followed by a 3-base binning index (BI) sequence. R2 includes the sequenced cDNA constructs created from the captured mRNA, which contain random cut sites that serve as intrinsic molecular identifiers (IMIs), and are used for molecular counting. Up to fifteen different cut sites can be created from a single captured molecule using five cycles of whole transcriptome amplification. The genome alignment position, which will be different for each cut site, serves as the IMI for each read.
The PIPseq platform uses a tiered barcode system, where every one of four tiers includes one of a specified list of possible barcodes. Barcode matching can therefore be done highly efficiently by matching each tier in isolation. The pipeline allows a Hamming distance of 1 per tier, i.e., the bases in the R1 FASTQ corresponding to each tier’s position can differ from a barcode by one base and still be matched to that barcode. The pipeline identifies the barcode combination and the binning index for each read. The barcode positions, BI position and the expected barcode list are set automatically with the --scrna-enable-pipseq-mode
argument, so they do not need to be specified by the user.
Custom barcode positions, BI positions and expected barcode list can be set with the following arguments:
The R2 file, which includes the cDNA construct, also includes technical sequences related to the PIPseq chemistry, namely the template switch oligo (TSO) and poly-A sequences. The prevalence of those sequences is inversely correlated with the fragment size. Because a read is less likely to be mapped accurately if it contains extraneous sequences, TSO sequences are removed from the 5’ end and poly-A sequences are removed from the 3’ end in the R2 FASTQ file. In addition, the first base, a constant T, is removed from the 5’ end of R2. These are done using the DRAGEN trimmers, with the settings automatically included under the --scrna-enable-pipseq-mode
flag. Stats on the number of reads and bases trimmed can be found in the <prefix>.trimmer_metrics.csv
file.
In PIPseq mode, the pipeline automatically ignores pseudogene, shortRNA and rRNA biotypes during mapping. to override this behavior, use --annotation-file-ignore-biotypes=none.
PIPSeq chemistry generates fragments with unique starting positions, which serve as intrinsic identifiers that can be used for molecular counting. Reads are grouped together based on the cell barcode and the assigned gene. Within each barcode and gene combination, IMIs are grouped in one of 64 bins, based on the 3-base binning index. For each bin, all identical IMIs are collapsed into a single count, since they are likely PCR duplicates of the same fragment generated during library prep.
At this point, any barcode and gene combination which has ten or fewer unique binning indexes is assigned the number of unique binning indexes as its final count estimate. The pipeline then totals the number of IMIs associated with each remaining barcode and gene combination, and divides that number by a correction factor estimated from the data, which accounts for the additional copies generated from a single captured molecule during five amplification cycles. The final count estimate is the maximum between the floor of this value and the number of unique binning indexes for this barcode and gene.
Estimating the correction factor is described in detail in the next section. The example below shows molecular counting with a correction factor of 1.35. The barcode and gene combination in this example has 12 unique binning indexes, so the correction factor is applied to produce the final count. The correction factor of 1.35 indicates that, on average, a single molecule produced 1.35 fragmented copies in this sample. Note that the corrected count will always be at least equal to the number of unique bins, and will not exceed the number of unique IMI and bin combinations. The result of this correction process, 14, is the number that is recorded in the count matrix for this barcode and gene.
A single molecule can produce anywhere between one and fifteen fragmented copies during five amplification cycles. The true distribution of the number of fragmented copies varies between datasets, and is influenced by factors such as sample type and sequencing depth. For a given dataset, the correction factor is an estimate of the average of this distribution, the average IMIs per molecule (IPM) in the sample.
Because all IMIs from the same parent molecule share a binning index, the number of unique binning indexes observed within a specific barcode and gene is determined by the number of molecules, and is not impacted by the number of IMIs that were produced by the molecules. This means the probabilistic relationship between the number of unique bins and the true number of molecules in a barcode and gene combination is constant, and is the result of random sampling from the 64 possible bin indexes when each molecule is captured. We can represent this process by the Coupon Collector Problem (CCP) [2], which defines this distribution of the number of samples (molecules) required to a certain number of distinct types (unique bins) among the total collection of unique types (64 binning indexes). The combination of constant molecular probabilities based on the number of unique bins and observed unique IMIs in barcode and gene combinations in the sample enable the IPM estimation. We select the subset of barcode and gene combinations with between 5 and 32 unique bin indexes. For each of these barcode and genes, we divide the total number of IMIs by the average number of molecules expected based on the number of unique bin indexes. The estimated average IMIs per molecule (IPM) for the sample is the average of these values across this subset of barcode and genes. The subset of barcode and gene combinations is used rather than all available combinations because we expect the CCP model to be most representative of the true molecular count and noise from sources such as sequencing errors or reads outside the cell fraction to be minimized within this subset.
After calculating the average IMIs per molecule in the sample, we apply the IPM to estimate the molecular count from the IMI count in individual barcode and gene combinations. The estimated molecular count for a barcode and gene is the total number of IMIs divided by the IPM, rounded down. The more true molecules a barcode and gene combination has, the closer the estimated molecular count is expected to be with this approach (as the number of molecules increases, the true average IMIs per molecule for a barcode and gene should approach the average IPM of the sample). For barcode and gene combinations with very few molecules, the number of unique bins is expected to be a better predictor of the molecular count than the number of IMIs because the variance in the true IMIs per molecule among this group is high since the number of samples (molecules) in each individual barcode and gene combination is low. For this reason, we apply IPM correction for barcode and gene combinations with more than 10 unique bin indexes, and otherwise the corrected count is equal to the number of unique bin indexes.
When the --scrna-enable-pipseq-mode
batch option is set to true, the following options are automatically included:
--enable-rna=true
--enable-single-cell-rna=true
--bypass-anchor-mapping=true
--annotation-file-ignore-biotypes=pseudogene,shortrna,rrna
--scrna-barcode-position=0_7+11_16+20_25+31_38
--scrna-umi-position=39_41
--scrna-count-introns=true
--umi-source=read1
--scrna-barcode-sequence-list=<PIPSEQ_BARCODE_SEQ_LIST>
--read-trimmers=fixed-len,adapter,polya
--trim-r1-5prime=1
--trim-polya-min-trim=12
--trim-adapter-r1-5prime=<PIPSEQ_ADAPTER_LIST>
--trim-adapter-stringency=12
The <PIPSEQ_BARCODE_SEQ_LIST>
and <PIPSEQ_ADAPTER_LIST>
are packaged with DRAGEN and the path to the files will be automatically detected. If any of these options are also included in the DRAGEN command line, then it will override the value set by the batch option. NOTE: The binning index position is set using the --scrna-umi-position
option.
The following is an example command line to run the DRAGEN Single Cell RNA PIPseq Pipeline.
The primary outputs are the count matrix files and the metrics files which are identical to the single cell RNA pipeline.
There are two sets of matrix files: the raw matrix and the filtered matrix files. The raw matrix files are:
<prefix>.scRNA.matrix.mtx.gz
Count of unique MIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix.
A threshold is applied to segregate the cells into PASS cells and LOW non-cell barcodes. The filtered matrix files are:
<prefix>.scRNA.filtered.matrix.mtx.gz
Count of unique MIs for each PASS cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
Cell-barcode sequence for each PASS cell from the matrix.
<prefix>.scRNA.filtered.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix (identical to <prefix>.scRNA.features.tsv.gz
).
The barcodeSummary file <prefix>.scRNA.barcodeSummary.txt
includes a full list of the PASS and LOW cells as well as the number of reads, molecules, and genes for each barcode.
The single-cell RNA metrics file <prefix>.scRNA_metrics.csv
includes summary metrics on the read-barcodes, the transcripts and features, the molecule counts, and the cell counts. It also includes the threshold used for calling PASS cells.
Note that additional output files and figures for PIPseq mode may be found in the DRAGEN Reports output.
DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Setting --scrna-enable-pipseq-crispr-mode
to true activates this mode.
Activating PIPseq CRISPR mode automatically configures DRAGEN for processing feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in the same single cell workflow, so extra steps are added to identify the hook sequence of gRNA reads. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.
The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.
When PIPseq CRISPR mode is activated, --scrna-enable-pipseq-mode
must also be set to true. For a standard PIPseq CRISPR workflow, the CRISPR input is usually provided as an additional pair of FASTQ files and passed to DRAGEN with a FASTQ list input. Furthermore, a CRISPR barcode reference file must be provided using the --scrna-feature-barcode-reference
argument.
The CRISPR barcode reference file should be in CSV format with the following header fields:
Examples of target, position, and sequence values for common use cases are provided below:
The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz
. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz
raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz
matrix file.
The following additional metrics are output in <prefix>.scRNA_metrics.csv
when PIPseq CRISPR mode is enabled:
Total CRISPR reads matching known barcodes
Total number of guide RNA FASTQ barcodes that match the known barcode list after error-correction
CRISPR tag mapping rate
Total count of guide RNAs in the raw matrix divided by "Total CRISPR reads matching known barcodes"
CRISPR fraction valid reads in cells
Total count of guide RNAs in the filtered matrix divided by "Total CRISPR reads matching known barcodes"
Fraction passing cells with CRISPR reads
Number of filtered cell barcodes with at least 1 guide RNA divided by the total number of filtered (or PASS) cell barcodes
Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.
“The Coupon Collector Problem.” 2022. University of Alabama in Huntsville. April 24, 2022. https://stats.libretexts.org/@go/page/10250.
See the general page for information about additional options.
A full description of the output files can be found in the user guide.
Barcode positions
--scrna-barcode-position
0_7+11_16+20_25+31_38
BI position
--scrna-umi-position
39_41
Expected barcode list
--scrna-barcode-sequence-list
<PIPSEQ_BARCODE_SEQ_LIST_PATH>
id
Feature/gene ID
name
Feature/gene name
read
Read that the guide RNA is on (R1 or R2)
target
Target or "hook" sequence. If the value is set to "5p", then the target sequence is based on the beginning of the read (default value: "5p")
position
Displacement of the start of the identifying guide RNA sequence relative to the start of the target sequence (can be positive or negative)
sequence
Guide RNA sequence
feature_type
Name of feature (e.g. CRISPR Direct Capture)