1 of 4

DRAGEN Single-Cell Pipeline

scRNA

The DRAGEN Single-Cell RNA (scRNA) Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene UMI count gene expression matrix. The pipeline is compatible with library designs that have one read in a fragment matched to a transcript and the other containing a cell-barcode and UMI. The pipeline includes the following functions:

RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
Cell-barcode and UMI error correction for the barcode read.
UMI counting per cell and gene to measure gene expression.
Sparse matrix output and QC metrics.
Feature counting, such as with cell-surface proteins.

The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline. For information, see DRAGEN RNA Pipeline. Other RNA-Seq modules, such as gene fusion calling or transcript-level gene expression quantification are not supported for Single-Cell RNA.

Input Files

Alignment Reference

Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. For example, build using --ht-build-rna-hashtable=true. The pipeline also requires a gene annotation file in GTF format, provided with the --annotation (-a) option.

Read Input

The DRAGEN scRNA Pipeline requires both the transcript sequence and the barcode+UMI sequence for each fragment (read) as input. The transcript sequence is aligned to the genome to determine the expressed gene, the barcode+UMI sequence is split into the single-cell barcode to identify the unique cell, and a single-cell UMI for unique molecule quantification. When starting from FASTQ, you can either include the UMI in the read name or provide separate UMI FASTQ files.

UMI in Read Name

Provide the transcriptome reads as a single-end FASTQ file with the Barcode+UMI sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz) as the transcript read.

@D00626:253:CAT5WANXX:4:1302:8433:85343:GAAACTCGTTCAGCGCTCATCTCGGC  
ACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGAAGTTTGCATCCTGCACAGCTAGAGATC  
+  
CCCCBFGGGGGGGGGGGGGGGGGGBDCGGGE>1BGDFGG0F@FBGGGBCDGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGFGGGGFGEGGGGGE

In the example, the GAA sequence is the barcode+UMI read and the ACAG sequence is the transcriptome read.

These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode+UMI read. If using bclConvert, enter the barcode/UMI information using the OverrideCycles1 setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).

Note: bclConvert refers to the entire single-cell barcode+UMI sequence as UMI.

Enter the following command line option to use the generated FASTQ files from bclConvert:

dragen -1 <file name> --umi-source=qname

The option is also compatible with the --fastq-list input options and with read input from BAM files.

Single-cell RNA Fastq-list Files

A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:

In this case, DRAGEN needs to accept --umi-source=read1 (or --umi-source=read2 if swapped) command-line option. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=read1 option:

Lane,RGID,RGSM,RGLB,Read1File,Read2File
1,rna,sample1,illumina1,scRNA1.R1.fastq.gz,scRNA1.R2.fastq.gz
2,rna,sample1,illumina2,scRNA2.R1.fastq.gz,scRNA2.R2.fastq.gz

Alternatively, the FASTQ file with transcriptomic reads can be specified under Read1File column and the FASTQ file with cell-barcodes and/or UMIs - under UmiFile column. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=umifile option:

Lane,RGID,RGSM,RGLB,Read1File,UmiFile
1,rna,sample1,illumina1,scRNA1.R2.fastq.gz,scRNA1.R1.fastq.gz
2,rna,sample1,illumina2,scRNA2.R2.fastq.gz,scRNA2.R1.fastq.gz

Separate `UMI Fastq` Files

An alternative option is to provide the transcript and barcode+UMI sequences as two separate FASTQ files. One file contains only the transcriptome reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.

Enter the following command line option to use the separate UMI FASTQ files:

dragen -1 <file name> --umi-fastq=<file name> --umi-source=fastq

To use this method with multiple FASTQ files:

Enter the barcode+UMI FASTQ files as read1 in the fastq-list file, and then enter the transcriptome read FASTQ files, matching the default fastq_list.csv generated by bclConvert, as read2.
Use the following command:

dragen --fastq-list fastq_list.csv --umi-source=read1

Using Multiple Libraries

The scRNA pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.

DRAGEN Single-cell Settings

To use the scRNA workflow, use the options --enable-rna=true --enable-single-cell-rna=true. This section includes information on additional scRNA settings.

Barcode Position

By default the scRNA workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to define the location of the single-cell barcode and single-cell UMI in the barcode read:

--scrna-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)] --scrna-umi-position <blockPos>

blockPos describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>. For example, for a library with a 16 bp cell-barcode followed by a 10 bp UMI, use: --scrna-barcode-position 0_15 --scrna-umi-position 16_25. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, use: --scrna-barcode-position=0_8+21_29+43_51 --scrna-umi-position=52_59.

By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, use: --scrna-barcode-position 0_15:+ or --scrna-barcode-position=0_8+21_29+43_51:+. To explicitly specify the reverse strand, use: --scrna-barcode-position 0_15:- or --scrna-barcode-position=0_8+21_29+43_51:-.

UMI position can also be specified for feature reads, which are reads with a sequence tag specific to a feature (eg, cell-surface protein or antibody). When the feature-specific UMIs are located on Read 2, you can use --scrna-feature-barcode-r2umi=0_11 to specify a 12 bp feature UMI at the beginning of each feature read.

Known Barcode Lists

You can provide a list of cell barcode sequences to include using the following command:

--scrna-barcode-sequence-list </path/to/barcodeAllowlist.txt>

In the case where the --scrna-barcode-position parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-, e.g.:

#-Block1
<barcode_block_1_sequence_1>
<barcode_block_1_sequence_2>
...
#-Block2
<barcode_block_2_sequence_1>
<barcode_block_2_sequence_2>
...

The input file can also be provided compressed with gzip (*.txt.gz).

During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. If the barcodes cannot be corrected, they are filtered out.

Cell Filtering

DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.

--single-cell-number-cells --- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold --- Specify the method for determining the count threshold value. The available values are fixed, ratio, or inflection.
- If using ratio, DRAGEN estimates the expected number of cells as max(T_e, T_m). T_m is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e is a threshold based on a fraction of the least abundant expected cell.
- If using inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
- If using fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cells option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby --- [Optional] Set the count distribution to consider for cell filtering. Can be either "umi" (default) or "read".

To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:

--single-cell-threshold=fixed --single-cell-number-cells=X

The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.

Additional Options

The following are additional options you can use to configure the Single-Cell RNA Pipeline settings.

--rna-library-type --- Set the orientation of transcript reads relative to the genomes. Enter SF for forward, SR for reverse, or U for unstranded. The default is SF.
--scrna-count-introns --- Include intronic reads in gene expression estimation. The default is false.
--qc-enable-depth-metrics --- Set to false to disable depth metrics for faster run time. The default is true.
--bypass-anchor-mapping --- Set to true to disable RNA anchor (two-pass) mapping for increased performance. The default is false.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1

Outputs

Single-cell RNA outputs are found in the standard DRAGEN output location using the prefix <sample>. in case of a single library and the prefix <sample>.<libId>. in case of multiple libraries. All single-cell RNA output files contain word scRNA in their names.

Counts

The following three files provide information per-cell gene expression level in matrix market (*.mtx) format:

<prefix>.scRNA.matrix.mtx.gz
- Count of unique UMIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
- Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.genes.tsv.gz
- Gene name and ID for each gene in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scRNA.barcodeSummary.tsv indicated by values PASS and FAIL.

The output includes filtered matrix files which only include the per-cell gene expression for the PASS cells in matrix market (*.mtx) format. The scRNA.genes.tsv.gz files is common for the unfiltered and filtered matrices:

<prefix>.scRNA.filtered.matrix.mtx.gz
- Count of unique UMIs for each filtered cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
- Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
genes_path = "path/to/genes.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(genes_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

Alignments of the transcript reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB tag containing the cell-barcode and an RX tag containing the UMI. The alignments use the original sequences without any errors corrected. Fragments that do not have an associated barcode read, for example fragments trimmed on the input data, do not have XB and RX tags.

Overall Metrics

The <prefix>.scRNA_metrics.csv file contains per sample scRNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing cell barcode and UMI.

Barcode Read Metrics

Invalid barcode read: Overall barcode sequence (cell barcode + UMI) failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.

Transcript/Feature Read Metrics

Unique exon match: Reads with valid cell-barcode and UMI that match a unique gene.
Unique intron match: Reads do not match exons, but introns of exactly one gene. For example, if using the command --scrna-count-introns=true.
Ambiguous match: Reads match to multiple genes.
Wrong strand: Reads overlap a gene on the opposite strand defined by library type.
Mitochondrial reads: Reads map to the mitochondrial example, if there is a matching gene.
No gene match: Reads do not match to any gene. Includes intronic reads unless using --scrna-count-introns=true.
Filtered multimapper: Reads excluded due to multiple alignment positions in the genome.
Feature reads: Reads matching to features, when using feature counting.

UMI count metrics

Total counted reads: Reads with valid cell-barcode and UMI, matching a unique gene
Reads with error-corrected UMI: Counted reads where the UMI was error-corrected to match another similar UMI sequence.
Reads with invalid UMI: Reads that were not counted due to invalid UMI sequence. For example, pure homopolymer reads or reads containing Ns.
Sequencing saturation: Fraction of reads with duplicate UMIs. 1 - ( UMIs / Reads).
Unique cell-barcodes: Overall number of unique cell-barcode sequences in counted reads only.
Total UMIs: Overall number of unique cell-barcode and UMI combinations counted.

Cell Metrics

UMI threshold for passing cells: Number of UMIs required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction genic reads in cells: Counted reads assigned to cells that passed the filters.
Fraction reads in putative cells: All counted reads assigned to cells that passed the filters.
Median reads per cells: Total counted reads per cell that passed the filters.
Median UMIs per cells: Total counted UMIs per cell that passed the filters.
Median genes per cells: Genes with at least one UMI per cell that passed the filters.
Total genes detected: Genes with at least one UMI in at least one cell that passed the filters.

Per-cell metrics

The <prefix>.scRNA.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction.

ID: Unique numeric ID for the cell-barcode. The ID corresponds to the line number of that barcode in the UMI count matrix (*.mtx) output.
Barcode: The cell-barcode sequence.
TotalReads: Total reads with the cell-barcode sequence. This includes error corrected reads.
GeneReads: Reads (primary read alignments) counted towards a gene.
UMIs: Total number of UMIs in counted reads.
Genes: Unique genes detected.
Mitochondrial Reads: Reads mapped to mitochondrial genome.
Filter: The following are the available filter values:
- PASS: Cell-barcode passes the filter.
- LOW: UMI count is below threshold.

Barcode error correction

Cell-barcode sequences from the input reads are error corrected based on their frequency, and optionally through a list of expected cell-barcode sequences. A cell-barcode sequence is corrected into another cell-barcode sequence if they differ only by one base (Hamming distance 1) and:

Either the corrected cell-barcode is at least two times more frequent across all input reads
Or the corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not

When corrected, all the original cell-barcode reads are assigned to the corrected cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020)¹.

¹Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 15 October 2020].

To avoid overcounting UMIs based on sequence errors, UMI error correction is performed among all reads with the same cell-barcode mapping to the same gene. UMI sequences that are likely errors of another UMI are not counted.

Genotype Sample Demultiplexing

DRAGEN implements several strategies for demultiplexing data sets that represent mixtures of cells from different individuals, such as cells pooled in one library prep or microfluidic run. Two of these strategies include a genotype-based and genotype-free demultiplexing. In genotype demultiplexing methods, DRAGEN can assign sample identity to cells based on alleles observed in reads in each cell (only SNVs are considered). DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.

To use genotype-based sample demultiplexing, you must provide a VCF file with genotypes for each sample in the data set. To use genotype-free sample demultiplexing, you must provide a VCF file with a set of external samples preferably coming from a population with the same genetic background. The GT field represents the sample genotypes.

For information on the cell-hashing demultiplexing method, see Cell-Hashing.

Command-Line Options

You can use the following command line options for scRNA demultiplexing.

One of the two:
- --scrna-demux-sample-vcf — If using genotype-based sample demultiplexing, specify the VCF file that contains the sample genotypes.
- --scrna-demux-reference-vcf — If using genotype-free sample demultiplexing, specify the VCF file that contains the genotypes of a population with a similar genetic background to the samples you are using.
--scrna-demux-detect-doublets — Enable the doublet detection in genotype-based sample demultiplexing. The default value is false.
--scrna-demux-number-samples — The number of samples you are using. This option is only applicable when using an external VCF reference specified with the scrna-demux-reference-vcf option, for genotype-free sample demultiplexing.

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-based demultiplexing.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublet=true \
--scrna-demux-sample-vcf=sample.vcf

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-free demultiplexing.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublet=true \
--scrna-demux-reference-vcf=sample.vcf \
--scrna-demux-number-samples=4

Outputs

You can find information related to the output of genotype-based scRNA sample demultiplexing in the following three files.

The <prefix>.scRNA.barcodeSummary.tsv file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv metrics.

The <prefix>.scRNA.demux.tsv file contains sample demultiplexing statistics that were used to infer sample identity of each cell.

The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.

Tumor/non-tumor Cell Demultiplexing

In addition to cell demultiplexing based on sample or population genotypes, there is an additional option to calculate read likelihoods for all scRNA reads conditioned on a given list of SNVs. The SNVs are specified by providing a somatic VCF (typically from a DNA WES tumor-normal somatic VC output). Note that the input VCF is "trusted" completely (ignoring variant call quality). The read-likelihoods are combined at the cell level and used to classify cells as tumor or non-tumor.

Command-Line Options

--scrna-demux-tumor-normal-vcf — A tumor-normal somatic variant call VCF that has tumor and normal sample columns. It is recommended for the VCF to have more than ~100 confident SNVs but that threshold may vary based on the coverage of the scRNA data. Having too few somatic variants runs the risk of not having enough scRNA reads intersecting the SNVs, leading to most cells being unclassified due to insufficient information.
--scrna-demux-tn-threshold — The log-likelihood ratio threshold for classifying a cell as tumor or non-tumor. The default value is set to 2.0, but we recommend user to test control samples (e.g. high and low tumor-fraction samples) to find an optimal threshold tailored to their experimental design and sample type.

Example DRAGEN Single Cell RNA command line using genotype-based demultiplexing:

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position="0_15" \
--scrna-umi-position="16_25" \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/9 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--scrna-barcode-sequence-list=737K-august-2016.txt \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublets=false \
--scrna-demux-tumor-normal-vcf=TN_somatic.vcf.gz \
--scrna-demux-tn-threshold=2

Outputs

The output for tumor/non-tumor scRNA cell demultiplexing can be found in the following three files.

The <prefix>.scRNA.demux.tsv file contains demultiplexing statistics that are used to infer the identity of each cell. Only cells in which variant-supporting reads are found are recorded in this file.

The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv. Ambiguous calls, i.e. 'AMB(NORMAL,TUMOR)', are excluded from the calculation.

Cell-Hashing

DRAGEN implements several strategies for demultiplexing of data sets that represent mixtures of cells from different individuals, such as cells from different individuals pooled in one library prep or microfluidic run. One of these methods is a sample oligo-tag based method, referred to as cell-hashing.

To use cell-hashing, you must provide a cell-hashing CSV or FASTA reference file. In CSV format, the feature barcode reference file uses the following header: id,name,read,position,sequence,feature_type.

id — Identifier of the feature. For example, ADT_A1018.
name — Name of the feature. For example, ADT_Hu.HLA.DR.DP.DQ_A1018.
read — Read 1 (R1) or Read 2 (R2).
position — Position on the specified read, including starting position and the length of the feature barcode. For example, a position of 0_15 represents a feature barcode that starts at position 0 and has a length of 15.
sequence — DNA sequence of the feature barcode. For example, CAGCCCGATTAAGGT.
feature_type — Type of the feature. For example, Antibody Capture.

Command-Line Options

To enable cell-hashing sample demultiplexing, specify the following command line options.

--scrna-cell-hashing-reference --- Specify a CSV or FASTA cell-hashing reference file that contains sample-specific oligo-tags.
--scrna-demux-detect-doublets --- Enable doublet detection in cell-hashing sample demultiplexing. The default value is false.
--scrna-demux-sample-fastq --- Output sample-specific FASTQ files. See Sample-Specific FASTQ Output Files for more information.

Outputs

The <prefix>.scRNA.barcodeSummary.tsv file contains per-cell metrics, including cell barcodes. The following column in the <prefix>.scRNA.barcodeSummary.tsv contains cell-hashing per-cell information. For more information on the <prefix>.scRNA.barcodeSummary.tsv file, see Single Cell RNA Outputs.

The <prefix>.scRNA.demux.tsv file contains sample demultiplexing statistics that were used to infer sample identity of each cell.

Sample-Specific FASTQ Output Files

If you have enabled either of the sample demultiplexing algorithms, you can output sample-specific FASTQ files after the sample identities for each cell is available using the command line option --scrna-demux-sample-fastq.

If gzip is specified, then the sample-specific output FASTQ files are compressed in gzip format. If fastq is specified, then the sample-specific FASTQ files are not compressed. The default option is none, which indicates that no sample-specific FASTQ files are produced.

Feature Counting

Feature counting is a technique to profile the expression of proteins (e.g., cell surface proteins or antibodies). Feature reads differ from RNA reads in that their read R2 is not a transcriptomic sequence, but rather an oligo tag corresponding to a particular type of protein.

To enable feature counting, specify the following command-line options:

--scrna-feature-barcode-reference — Specify a CSV or FASTA feature reference file that contains feature barcodes. The CSV file format is similar to the format used for Cell-Hashing.
--scrna-feature-barcode-groups — If feature reads are specified as separate FASTQ files, specify a comma-separated list of read groups that correspond to feature FASTQ files.

The output of feature counting is appended to the output expression matrix. The extended expression matrix contains additional rows corresponding to the features.

scATAC

The DRAGEN Single-Cell ATAC (scATAC) Pipeline can process single-cell ATAC-Seq data sets from reads to a cell-by-peak read count matrix. The pipeline includes the following functions:

ATAC-Seq alignment.
Cell-barcode error correction for the barcode read.
Chromatin accessibility peak calling.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.

The functionality and options related to alignment and gene annotation are identical to DNA pipelines. For information, see DRAGEN DNA Pipeline.

Input Files

Alignment Reference

Use a standard DRAGEN DNA reference genome or hashtable for the scATAC Pipeline.

Read Input

The DRAGEN scATAC Pipeline requires both the genomic sequence and the barcode sequence for each fragment (read) as input. The genomic sequence is aligned to the reference genome to determine the expressed gene, the single-cell barcode sequence is used to identify the unique cell. When starting from FASTQ, you can either include the UMI in the read name or provide separate cell-barcode FASTQ files.

UMI in Read Name

Provide the genomic reads as a paired-end FASTQ files with the Barcode sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz) as the genomic read.

@D00626:253:CAT5WANXX:4:1302:8433:85343:GAAACTCGTTCAGCGC  
ACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGAAGTTTGCATCCTGCACAGCTAGAGATC  
+  
CCCCBFGGGGGGGGGGGGGGGGGGBDCGGGE>1BGDFGG0F@FBGGGBCDGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGFGGGGFGEGGGGGE

In the example, the GAAACTCGTTCAGCGC sequence is the barcode read and the ACAG... sequence is the genomic read.

These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode read. If using bclConvert, enter the barcode information using the OverrideCycles1 setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).

Note: bclConvert refers to the entire single-cell barcode sequence as UMI.

Enter the following command line option to use the generated FASTQ files from bclConvert: dragen -1 <file name 1> -2 <file name 2> --umi-source=qname

The option is also compatible with the --fastq-list input options and with read input from BAM files.

Single-cell ATAC Fastq-list Files

A single-cell ATAC fastq-list file is a CSV file with the following mandatory columns:

An example is shown below:

Lane,RGID,RGSM,RGLB,Read1File,Read2File,UmiFile
1,atac,sample1,illumina1,scATAC1.R1.fastq.gz,scATAC1.R2.fastq.gz,scATAC1.R3.fastq.gz
2,atac,sample1,illumina2,scATAC2.R1.fastq.gz,scATAC2.R2.fastq.gz,scATAC2.R3.fastq.gz

Separate `UMI Fastq` Files

An alternative option is to provide the genomic and barcode sequences as three separate FASTQ files. Two files contain only the genomic reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.

Enter the following command line option to use the separate UMI FASTQ files: dragen -1 <file name 1> -2 <file name 2> --umi-fastq=<file name 3> --umi-source=fastq To use this method with multiple FASTQ files, follow these steps:

Enter the barcode FASTQ files as read1 in the fastq-list file, and then enter the genomic read FASTQ files matching the default fastq_list.csv generated by bclConvert as read2 and umifile.
Enter the following command: dragen --fastq-list fastq_list.csv --umi-source=umifile

Using Multiple Libraries

The scATAC pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.

DRAGEN Single-cell Settings

To use the scATAC workflow, enter --enable-single-cell-atac=true. This section includes information on additional scATAC settings.

Barcode Position

By default the scATAC workflow assumes that the overall barcode sequence is made up of a single-cell barcode (possibly split into multiple blocks). Enter the following command to identify the location of the single-cell barcode:

--scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]

blockPos describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>. For example, for a library with a 16 bp cell-barcode, enter: --scatac-barcode-position 0_15. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, enter: --scatac-barcode-position=0_8+21_29+43_51.

By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, enter: --scatac-barcode-position 0_15:+ or --scatac-barcode-position=0_8+21_29+43_51:+. Conversely, to specify the reverse strand, enter: --scatac-barcode-position 0_15:- or --scatac-barcode-position=0_8+21_29+43_51:-.

Known Barcode Lists

You can provide a list of cell barcode sequences to include using the following command:

--scatac-barcode-sequence-list </path/to/barcodeAllowlist.txt>

In the case where the --scatac-barcode-position parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-, e.g.:

#-Block1
<barcode_block_1_sequence_1>
<barcode_block_1_sequence_2>
...
#-Block2
<barcode_block_2_sequence_1>
<barcode_block_2_sequence_2>
...

The input file might be compressed with gzip (*.txt.gz).

Cell Filtering

DRAGEN uses a threshold on the total count of unique reads per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.

--single-cell-number-cells --- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold --- Specify the method for determining the count threshold value. The available values are fixed, ratio, or inflection.
- If using ratio, DRAGEN estimates the expected number of cells as max(T_e, T_m). T_m is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e is a threshold based on a fraction of the least abundant expected cell.
- If using inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
- If using fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cells option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.

For example, to set a fixed number of cells rather than use the automatically determined threshold, use the following command:

--single-cell-threshold=fixed --single-cell-number-cells=X

The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.

Additional Options

The following are additional options you can use to configure the Single-Cell ATAC Pipeline settings.

--qc-enable-depth-metrics --- Set to false to disable depth metrics for faster run time. The default is true.
--scatac-write-fragments --- Set to true to write counted fragments to the disk (in both tsv (<prefix>.scATAC.fragments.tsv) and BigWig (<prefix>.scATAC.fragments.bigwig) format). The default is false.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell ATAC Pipeline.

dragen \
--enable-single-cell-atac=true \
--umi-source=fastq \
--scatac-barcode-position 0_15 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-1 lib1_S7_L001_R1_001.fastq.gz \
-2 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R3_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1

Outputs

Single-cell ATAC outputs are found in the standard DRAGEN output location using the prefix <sample>. in case of a single library and the prefix <sample>.<libId>. in case of multiple libraries. All single-cell ATAC output files contain word scATAC in their names.

Counts

The following three files provide information about per-cell chromatin accessibility level in matrix market (*.mtx) format:

<prefix>.scATAC.matrix.mtx.gz
- Count of unique fragments for each cell/peak pair in sparse matrix format.
<prefix>.scATAC.barcodes.tsv.gz
- Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scATAC.peaks.tsv.gz
- Peak name and ID for each peak in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scATAC.barcodeSummary.tsv indicated by values PASS and FAIL.

The output includes filtered matrix files which only include the per-cell chromatin accessibility level for the PASS cells in matrix market (*.mtx) format. The scATAC.peaks.tsv.gz file is common for the unfiltered and filtered matrices:

<prefix>.scATAC.filtered.matrix.mtx.gz
- Count of unique UMIs for each filtered cell/peak pair in sparse matrix format.
<prefix>.scATAC.filtered.barcodes.tsv.gz
- Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
peaks_path = "path/to/peaks.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(peaks_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

Alignments of the genomic reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB tag containing the cell-barcode. The alignments use the original sequences without any errors corrected. Fragments that did not have an associated barcode read, for example fragments trimmed on the input data, do not have XB tag.

Overall Metrics

The <prefix>.scATAC_metrics.csv file contains per sample scATAC metrics.

Barcode Read Metrics

Invalid barcode read: Overall barcode sequence failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.

Genomic Fragment Metrics

Fragments passing filters: Non-chimeric non-mitochondrial fragments that align to primary contigs with a high mapping quality (greater than 30 by default).
Non-primary contig fragments: Fragments that align to non-primary contigs (any contigs that are not autosome, X and Y).
Chimeric fragments: Fragments with the two reads aligning to different contigs.
Mitochondrial fragments: Fragments aligning to the mitochondrial contigs.
Low mapping quality fragments: Fragments with the two reads aligning with a mapping quality set to some specific value (default is 30).
Improperly mapped fragments: The two reads in the fragment are not mapped in proper pair (SAM flag "read mapped in proper pair" is set to 0).

Cell Metrics

Fragment threshold for passing cells: Number of fragments required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction peak fragments in passing cells: Percentage of counted fragments intersecting peaks assigned to cells that passed the filters.
Fraction fragments in passing cells: Percentage of all counted fragments assigned to cells that passed the filters.
Median fragments per cells: Total counted fragments per cell that passed the filters.
Median peaks per cells: Peaks with at least one fragment per cell that passed the filters.
Total peaks detected: Peaks with at least one fragment in at least one cell that passed the filters.

Per-cell metrics

The <prefix>.scATAC.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction.

ID: Unique numeric ID for the cell-barcode.
Barcode: The cell-barcode sequence.
TotalFragments: Total fragments with the cell-barcode sequence.
UniqueFragments: Unique fragments counted towards a peak.
NonPrimaryContigFragments: Unique non-primary contig framgnets.
ChimericFragments: Unique chimeric fragments.
LowMapqFragments: Unique low mapping quality fragments.
MitochondrialFragments: Unique fragments mapped to mitochondrial genome.
Peaks: Unique peaks detected.
Filter: The following are the available filter values:
- PASS: Cell-barcode passes the filter.
- LOW: UMI count is below threshold.

Barcode error correction

Cell-barcode sequences from the input reads are error corrected based on the frequency with which each one is seen and an optional allow list of expected cell-barcode sequences. A cell-barcode sequence is considered a neighbor of another cell-barcode if there is at most one mismatch. A cell-barcode sequence is corrected to its neighbor in the following circumstances. When corrected, all reads with the cell-barcode are assigned instead to the neighboring cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020).

The neighboring cell-barcode is at least two times more frequent across all input reads.
The neighboring cell-barcode is on the cell-barcode allow list, but the original cell-barcode is not.

To avoid overcounting cell-barcodes based on sequence errors, cell-barcode error correction is performed among all reads with the same cell-barcode mapping to the same peak region. Cell-barcode sequences that are likely errors of another cell-barcodes are not counted.

Ref: Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 1 March 2022].

Peak Calling

DRAGEN calls peaks using an algorithm based on MACS, version 3 (Zhang et al., 2008). To customize the behavior of the peak calling algorithm, modify any of the command line parameters specified in the table below.

(*), (**) - default value is computed automatically as the mean fragment length.

Alternatively, if fragment counting needs to be performed on a pre-specfified set of peaks, provide a peak BED file using command line parameter atac-peak-bed-file.

Ref: Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown M., Li. W. and Liu, X.S., 2008. Model-based Analysis of ChIP-Seq (MACS). [PDF] Available at: <https://pubmed.ncbi.nlm.nih.gov/18798982/> [Accessed 1 March 2022].

Peak Annotation

DRAGEN annotates each peak with respect to a gene symbol as promoter, distal, or intergenic depending on the genomic position of both the peak and the gene. The following rules are used to determine the annotation of a peak:

If a peak overlaps with promoter region (-1000bp, +100bp) of any transcription start site (TSS), it is annotated as a promoter peak of the gene.
If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.

To enable peak annotation in DRAGEN scATAC-Seq workflow, specify a gene annotation file (GTF) using the option -a. Peak annotations are written to a file with name <prefix>.scATAC.peaks.tsv and each annotation is represented as a row with the following 6 columns:

Chromosome number
Start position
End position
Gene symbol
Distance from peak to gene
Peak annotation (i.e, promoter, distal, or intergenic).

Transcription Factor Motif Analysis

In this analysis, peaks are matched to a set of known transcription factor (TF) binding sites and for each cell barcode the fragment counts are grouped based on transcription factors their peaks are assigned to. This results in a more compact representation of chromatin accessibility patterns. To enable TF motif analysis, specify a database of position-weight matrices (PWM) corresponding to transcription factor motifs in JASPAR format:

--atac-jaspar-database=JASPAR2022_CORE_non-redundant_pfms_jaspar.txt

DRAGEN will produce two files <prefix>.scATAC.tf.matrix.mtx.gz and <prefix>.scATAC.tf.motifs.tsv.gz which combined with file <prefix>.tf.barcodes.tsv.gz represent the cell-by-TF count matrix in matrix market format.

Single-Cell Multiomics

The DRAGEN Single-Cell Multiomics (scRNA + scATAC) Pipeline can process data sets from single-cell RNA-Seq and ATAC-Seq reads to a cell-by-feature count matrix.

The pipeline is compatible with library designs that have:

For single-cell RNA: one read in a fragment match to a transcript and the other containing a cell-barcode and UMI.
For single-cell ATAC: two reads matching to the genome and the third one containing cell-barcode.

The pipeline includes the following functions:

Alignment
- RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
- ATAC-Seq alignment
Cell-barcode (both RNA- and ATAC-Seq) and UMI (RNA-Seq only) error correction for the barcode read.
Read counting
- UMI counting per cell and gene to measure gene expression.
- Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.

Input Files

Alignment Reference

A standard DRAGEN reference hashtable with both DNA and RNA capability is required for the Single-Cell Multiomics Pipeline. Building a reference hashtable using --ht-build-rna-hashtable=true should satisfy this requirement. See the section Prepare a Reference Genome for more details.

The pipeline also requires a gene annotation file in GTF format, provided with the with the --annotation (-a) option.

Read Input

Since the multiomics workflow assumes at least one pair of single-cell RNA FASTQ files and one triple of single-cell ATAC FASTQ files, the only method to provide read input to DRAGEN is through the fastq-list file mechanism. A multiomics fastq-list file is a CSV file with the following mandatory columns:

Entries in a fastq-list file corresponding to single-cell RNA analysis must have read 2 FASTQ files in the UmiFile column and the Read2File column entries must be left empty. An example is shown below:

Lane,RGID,RGSM,RGLB,Read1File,Read2File,UmiFile,InputGroup
1,rna,sample1,illumina,scRNA.R1.fastq.gz,,scRNA.R2.fastq.gz,SCRNA
2,atac,sample1,illumina,scATAC.R1.fastq.gz,scATAC.R2.fastq.gz,scATAC.R3.fastq.gz,SCATAC

DRAGEN Single-cell Settings

To use the multiomics workflow, enter --enable-rna=true --enable-single-cell-rna=true --enable-single-cell-atac=true.

Barcode Position

By default the multiomics workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to identify the location of the single-cell barcode and single-cell UMI in the barcode read:

--scrna-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)] --scrna-umi-position <blockPos> --scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]

For more details, please consult with the corresponding section of scRNA and scATAC workflows' documentation.

Known Barcode Lists

Barcode lists are mandatory for both scRNA and scATAC reads. You need to provide the list of cell barcode sequences using the following command:

--scrna-barcode-sequence-list </path/to/scRNAbarcodeAllowlist.txt> --scatac-barcode-sequence-list </path/to/scATACbarcodeAllowlist.txt>

The files must contain one possible cell barcode sequence per line. You can compress the file with gzip (*.txt.gz). During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. If the barcodes cannot be corrected, they are filtered out.

Cell Filtering

For each individual modality (either scRNA or scATAC), DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample. For more information, see the corresponding section in scATAC/scRNA documentation.

After count thresholds in each individual modality is computed, DRAGEN performs a joint cell filtering step. Each cell barcode is represented in a 2-D space with coordinates computed as the total UMI count across genes and the total fragment count across peaks. Initially, a cell-barcode is considered as passing the joint filter if it is passing the filter in each individual modality. DRAGEN then groups all cell barcodes in two categories: those passing both individual modality filters and the rest of cell barcodes. A k-means algorithm with 2 clusters is run and the filtering status of each cell barcode is refined.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell Multiomics Pipeline.


dragen \
--enable-rna=true \
--enable-single-cell-atac=true \
-r hg19.fa.default \
--ht-alt-aware-validate=false \
--fastq-list=multiomics_fastq_list.csv \
--umi-source=umifile \
--output-dir=multiomics_output \
--output-file-prefix=sample1 \
--scatac-barcode-position=0_15 \
--scrna-barcode-position=0_15 \
--scrna-umi-position=16_25 \
--single-cell-threshold=ratio \
--enable-single-cell-rna=true \
-a gencode.v32.primary_assembly.annotation.filtered.gtf \
--scrna-barcode-sequence-whitelist=737K-arc-v1-gex.gz \
--scatac-barcode-sequence-whitelist=737K-arc-v1-atac.gz

Outputs

Single-cell Multiomics outputs are found in the standard DRAGEN output location using the prefix <sample>. in case of a single library and the prefix <sample>.<libId>. in case of multiple libraries. All single-cell Multiomics output files contain word multiomics in their names.

Counts

The following three files provide information per-cell feature count level in matrix market (*.mtx) format:

<prefix>.multiomics.matrix.mtx.gz
- Count of unique UMIs or fragments for each cell/feature pair in sparse matrix format.
<prefix>.multiomics.barcodes.tsv.gz
- Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.multiomics.features.tsv.gz
- Feature name and ID for each feature in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.multiomics.barcodeSummary.tsv indicated by values PASS and FAIL.

The output includes filtered matrix files which only include the per-cell feature count level for the filtered cells in matrix market (*.mtx) format. The multiomics.features.tsv.gz file is common for the unfiltered and filtered matrices:

<prefix>.multiomics.filtered.matrix.mtx.gz
- Count of unique UMIs for each filtered cell/feature pair in sparse matrix format.
<prefix>.multiomics.filtered.barcodes.tsv.gz
- Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
features_path = "path/to/features.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(features_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

DRAGEN Single-Cell Multiomics outputs two BAM files sorted by coordinate - one with suffix scRNA.bam and one with suffix scATAC.bam. For more details, please consult the corresponding section from scATAC/scRNA user guide.

When running in multiomics mode, since the DRAGEN aligner processes both RNA and ATAC reads, there will be separate mapper metrics summary for each modality. This is identified via the RGID field as rna or atac. There is no common map/align metrics summary for all input reads since it would merge the two modalities. The <prefix>.mapping_metrics.csv file will also reflect this.

Below is an example snippet of the alignment metrics printed at the end of a DRAGEN Single-Cell Multiomics run.

MAPPING/ALIGNING PER RG   rna   Total reads in RG                                        52319680       100.00
MAPPING/ALIGNING PER RG   rna   Number of duplicate marked reads                         0              0.00
MAPPING/ALIGNING PER RG   rna   Number of duplicate marked and mate reads removed        NA
MAPPING/ALIGNING PER RG   rna   Number of unique reads (excl. duplicate marked reads)    52319680       100.00
...

MAPPING/ALIGNING PER RG   atac  Total reads in RG                                        203484254      100.00
MAPPING/ALIGNING PER RG   atac  Number of duplicate marked reads                         0              0.00
MAPPING/ALIGNING PER RG   atac  Number of duplicate marked and mate reads removed        NA
MAPPING/ALIGNING PER RG   atac  Number of unique reads (excl. duplicate marked reads)    203484254      100.00
...

Overall Metrics

The <prefix>.multiomics_metrics.csv file contains per sample scATAC and scRNA metrics. For more details, please consult with the corresponding section from scATAC/scRNA user guide.

Here is an example of how a <prefix>.multiomics_metrics.csv file can look like:

SINGLE-CELL ATAC METRICS,lib1,Invalid barcode fragments,0
SINGLE-CELL ATAC METRICS,lib1,Error free cell-barcode,308656
SINGLE-CELL ATAC METRICS,lib1,Error corrected cell-barcode,251364
...
SINGLE-CELL RNA METRICS,lib1,Median genes per cell,1456
SINGLE-CELL RNA METRICS,lib1,Total genes detected,14128
SINGLE-CELL METRICS,lib1,Passing cells,1920

Per-cell Metrics

The <prefix>.multiomics.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction. Here is an example of how a <prefix>.multiomics.barcodeSummary.tsv file can look like:

ID  Barcode           BarcodeScAtac     BarcodeScRna      UniqueFragments  TotalFragments  PeakFragments  NonPrimaryContigFragments  ChimericFragments  LowMapqFragments  MitochondrialFragments  Peaks  TotalReads  GeneReads  UMIs  Genes  MitochondrialReads  Filter
1   AAACAAGCAAACAAAG  AAACAAGCAAACAAAG  GGCTAGTGTTCGGTAA  22               48              13             17                         0                  2                 7                       9      59          0          0     0      0                   LOW
2   AAACAAGCAAACCAGC  AAACAAGCAAACCAGC  GGCTAGTGTACTGAAT  7                7               3              0                          0                  0                 0                       3      8           0          0     0      0                   LOW

scATAC

The DRAGEN Single-Cell ATAC (scATAC) Pipeline can process single-cell ATAC-Seq data sets from reads to a cell-by-peak read count matrix. The pipeline includes the following functions:

ATAC-Seq alignment.
Cell-barcode error correction for the barcode read.
Chromatin accessibility peak calling.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.

The functionality and options related to alignment and gene annotation are identical to DNA pipelines. For information, see DRAGEN DNA Pipeline.

Input Files

Alignment Reference

Use a standard DRAGEN DNA reference genome or hashtable for the scATAC Pipeline.

Read Input

UMI in Read Name

@D00626:253:CAT5WANXX:4:1302:8433:85343:GAAACTCGTTCAGCGC  
ACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGAAGTTTGCATCCTGCACAGCTAGAGATC  
+  
CCCCBFGGGGGGGGGGGGGGGGGGBDCGGGE>1BGDFGG0F@FBGGGBCDGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGFGGGGFGEGGGGGE

In the example, the GAAACTCGTTCAGCGC sequence is the barcode read and the ACAG... sequence is the genomic read.

Note: bclConvert refers to the entire single-cell barcode sequence as UMI.

Enter the following command line option to use the generated FASTQ files from bclConvert: dragen -1 <file name 1> -2 <file name 2> --umi-source=qname

The option is also compatible with the --fastq-list input options and with read input from BAM files.

Single-cell ATAC Fastq-list Files

A single-cell ATAC fastq-list file is a CSV file with the following mandatory columns:

Column

Description

An example is shown below:

Lane,RGID,RGSM,RGLB,Read1File,Read2File,UmiFile
1,atac,sample1,illumina1,scATAC1.R1.fastq.gz,scATAC1.R2.fastq.gz,scATAC1.R3.fastq.gz
2,atac,sample1,illumina2,scATAC2.R1.fastq.gz,scATAC2.R2.fastq.gz,scATAC2.R3.fastq.gz

Separate `UMI Fastq` Files

Enter the barcode FASTQ files as read1 in the fastq-list file, and then enter the genomic read FASTQ files matching the default fastq_list.csv generated by bclConvert as read2 and umifile.
Enter the following command: dragen --fastq-list fastq_list.csv --umi-source=umifile

Using Multiple Libraries

DRAGEN Single-cell Settings

To use the scATAC workflow, enter --enable-single-cell-atac=true. This section includes information on additional scATAC settings.

Barcode Position

--scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]

Known Barcode Lists

You can provide a list of cell barcode sequences to include using the following command:

--scatac-barcode-sequence-list </path/to/barcodeAllowlist.txt>

#-Block1
<barcode_block_1_sequence_1>
<barcode_block_1_sequence_2>
...
#-Block2
<barcode_block_2_sequence_1>
<barcode_block_2_sequence_2>
...

The input file might be compressed with gzip (*.txt.gz).

Cell Filtering

--single-cell-number-cells --- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold --- Specify the method for determining the count threshold value. The available values are fixed, ratio, or inflection.
- If using ratio, DRAGEN estimates the expected number of cells as max(T_e, T_m). T_m is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e is a threshold based on a fraction of the least abundant expected cell.
- If using inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
- If using fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cells option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.

For example, to set a fixed number of cells rather than use the automatically determined threshold, use the following command:

--single-cell-threshold=fixed --single-cell-number-cells=X

The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.

Additional Options

The following are additional options you can use to configure the Single-Cell ATAC Pipeline settings.

--qc-enable-depth-metrics --- Set to false to disable depth metrics for faster run time. The default is true.
--scatac-write-fragments --- Set to true to write counted fragments to the disk (in both tsv (<prefix>.scATAC.fragments.tsv) and BigWig (<prefix>.scATAC.fragments.bigwig) format). The default is false.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell ATAC Pipeline.

dragen \
--enable-single-cell-atac=true \
--umi-source=fastq \
--scatac-barcode-position 0_15 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-1 lib1_S7_L001_R1_001.fastq.gz \
-2 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R3_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1

Outputs

Counts

The following three files provide information about per-cell chromatin accessibility level in matrix market (*.mtx) format:

<prefix>.scATAC.matrix.mtx.gz
- Count of unique fragments for each cell/peak pair in sparse matrix format.
<prefix>.scATAC.barcodes.tsv.gz
- Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scATAC.peaks.tsv.gz
- Peak name and ID for each peak in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scATAC.barcodeSummary.tsv indicated by values PASS and FAIL.

<prefix>.scATAC.filtered.matrix.mtx.gz
- Count of unique UMIs for each filtered cell/peak pair in sparse matrix format.
<prefix>.scATAC.filtered.barcodes.tsv.gz
- Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
peaks_path = "path/to/peaks.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(peaks_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

Overall Metrics

The <prefix>.scATAC_metrics.csv file contains per sample scATAC metrics.

Barcode Read Metrics

Invalid barcode read: Overall barcode sequence failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.

Genomic Fragment Metrics

Fragments passing filters: Non-chimeric non-mitochondrial fragments that align to primary contigs with a high mapping quality (greater than 30 by default).
Non-primary contig fragments: Fragments that align to non-primary contigs (any contigs that are not autosome, X and Y).
Chimeric fragments: Fragments with the two reads aligning to different contigs.
Mitochondrial fragments: Fragments aligning to the mitochondrial contigs.
Low mapping quality fragments: Fragments with the two reads aligning with a mapping quality set to some specific value (default is 30).
Improperly mapped fragments: The two reads in the fragment are not mapped in proper pair (SAM flag "read mapped in proper pair" is set to 0).

Cell Metrics

Fragment threshold for passing cells: Number of fragments required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction peak fragments in passing cells: Percentage of counted fragments intersecting peaks assigned to cells that passed the filters.
Fraction fragments in passing cells: Percentage of all counted fragments assigned to cells that passed the filters.
Median fragments per cells: Total counted fragments per cell that passed the filters.
Median peaks per cells: Peaks with at least one fragment per cell that passed the filters.
Total peaks detected: Peaks with at least one fragment in at least one cell that passed the filters.

Per-cell metrics

The <prefix>.scATAC.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction.

ID: Unique numeric ID for the cell-barcode.
Barcode: The cell-barcode sequence.
TotalFragments: Total fragments with the cell-barcode sequence.
UniqueFragments: Unique fragments counted towards a peak.
NonPrimaryContigFragments: Unique non-primary contig framgnets.
ChimericFragments: Unique chimeric fragments.
LowMapqFragments: Unique low mapping quality fragments.
MitochondrialFragments: Unique fragments mapped to mitochondrial genome.
Peaks: Unique peaks detected.
Filter: The following are the available filter values:
- PASS: Cell-barcode passes the filter.
- LOW: UMI count is below threshold.

Barcode error correction

The neighboring cell-barcode is at least two times more frequent across all input reads.
The neighboring cell-barcode is on the cell-barcode allow list, but the original cell-barcode is not.

Peak Calling

Command line parameter

Meaning

Default value

(*), (**) - default value is computed automatically as the mean fragment length.

Alternatively, if fragment counting needs to be performed on a pre-specfified set of peaks, provide a peak BED file using command line parameter atac-peak-bed-file.

Peak Annotation

If a peak overlaps with promoter region (-1000bp, +100bp) of any transcription start site (TSS), it is annotated as a promoter peak of the gene.
If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.

Chromosome number
Start position
End position
Gene symbol
Distance from peak to gene
Peak annotation (i.e, promoter, distal, or intergenic).

Transcription Factor Motif Analysis

--atac-jaspar-database=JASPAR2022_CORE_non-redundant_pfms_jaspar.txt

scRNA

RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
Cell-barcode and UMI error correction for the barcode read.
UMI counting per cell and gene to measure gene expression.
Sparse matrix output and QC metrics.
Feature counting, such as with cell-surface proteins.

Input Files

Alignment Reference

Read Input

UMI in Read Name

@D00626:253:CAT5WANXX:4:1302:8433:85343:GAAACTCGTTCAGCGCTCATCTCGGC  
ACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGAAGTTTGCATCCTGCACAGCTAGAGATC  
+  
CCCCBFGGGGGGGGGGGGGGGGGGBDCGGGE>1BGDFGG0F@FBGGGBCDGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGFGGGGFGEGGGGGE

In the example, the GAA sequence is the barcode+UMI read and the ACAG sequence is the transcriptome read.

Note: bclConvert refers to the entire single-cell barcode+UMI sequence as UMI.

Enter the following command line option to use the generated FASTQ files from bclConvert:

dragen -1 <file name> --umi-source=qname

The option is also compatible with the --fastq-list input options and with read input from BAM files.

Single-cell RNA Fastq-list Files

A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:

Column

Description

Lane,RGID,RGSM,RGLB,Read1File,Read2File
1,rna,sample1,illumina1,scRNA1.R1.fastq.gz,scRNA1.R2.fastq.gz
2,rna,sample1,illumina2,scRNA2.R1.fastq.gz,scRNA2.R2.fastq.gz

Lane,RGID,RGSM,RGLB,Read1File,UmiFile
1,rna,sample1,illumina1,scRNA1.R2.fastq.gz,scRNA1.R1.fastq.gz
2,rna,sample1,illumina2,scRNA2.R2.fastq.gz,scRNA2.R1.fastq.gz

Separate `UMI Fastq` Files

Enter the following command line option to use the separate UMI FASTQ files:

dragen -1 <file name> --umi-fastq=<file name> --umi-source=fastq

To use this method with multiple FASTQ files:

Enter the barcode+UMI FASTQ files as read1 in the fastq-list file, and then enter the transcriptome read FASTQ files, matching the default fastq_list.csv generated by bclConvert, as read2.
Use the following command:

dragen --fastq-list fastq_list.csv --umi-source=read1

Using Multiple Libraries

DRAGEN Single-cell Settings

To use the scRNA workflow, use the options --enable-rna=true --enable-single-cell-rna=true. This section includes information on additional scRNA settings.

Barcode Position

--scrna-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)] --scrna-umi-position <blockPos>

Known Barcode Lists

You can provide a list of cell barcode sequences to include using the following command:

--scrna-barcode-sequence-list </path/to/barcodeAllowlist.txt>

#-Block1
<barcode_block_1_sequence_1>
<barcode_block_1_sequence_2>
...
#-Block2
<barcode_block_2_sequence_1>
<barcode_block_2_sequence_2>
...

The input file can also be provided compressed with gzip (*.txt.gz).

Cell Filtering

--single-cell-number-cells --- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold --- Specify the method for determining the count threshold value. The available values are fixed, ratio, or inflection.
- If using ratio, DRAGEN estimates the expected number of cells as max(T_e, T_m). T_m is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e is a threshold based on a fraction of the least abundant expected cell.
- If using inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
- If using fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cells option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby --- [Optional] Set the count distribution to consider for cell filtering. Can be either "umi" (default) or "read".

To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:

--single-cell-threshold=fixed --single-cell-number-cells=X

The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.

Additional Options

The following are additional options you can use to configure the Single-Cell RNA Pipeline settings.

--rna-library-type --- Set the orientation of transcript reads relative to the genomes. Enter SF for forward, SR for reverse, or U for unstranded. The default is SF.
--scrna-count-introns --- Include intronic reads in gene expression estimation. The default is false.
--qc-enable-depth-metrics --- Set to false to disable depth metrics for faster run time. The default is true.
--bypass-anchor-mapping --- Set to true to disable RNA anchor (two-pass) mapping for increased performance. The default is false.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1

Outputs

Counts

The following three files provide information per-cell gene expression level in matrix market (*.mtx) format:

<prefix>.scRNA.matrix.mtx.gz
- Count of unique UMIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
- Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.genes.tsv.gz
- Gene name and ID for each gene in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scRNA.barcodeSummary.tsv indicated by values PASS and FAIL.

<prefix>.scRNA.filtered.matrix.mtx.gz
- Count of unique UMIs for each filtered cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
- Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
genes_path = "path/to/genes.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(genes_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

Overall Metrics

Barcode Read Metrics

Invalid barcode read: Overall barcode sequence (cell barcode + UMI) failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.

Transcript/Feature Read Metrics

Unique exon match: Reads with valid cell-barcode and UMI that match a unique gene.
Unique intron match: Reads do not match exons, but introns of exactly one gene. For example, if using the command --scrna-count-introns=true.
Ambiguous match: Reads match to multiple genes.
Wrong strand: Reads overlap a gene on the opposite strand defined by library type.
Mitochondrial reads: Reads map to the mitochondrial example, if there is a matching gene.
No gene match: Reads do not match to any gene. Includes intronic reads unless using --scrna-count-introns=true.
Filtered multimapper: Reads excluded due to multiple alignment positions in the genome.
Feature reads: Reads matching to features, when using feature counting.

UMI count metrics

Total counted reads: Reads with valid cell-barcode and UMI, matching a unique gene
Reads with error-corrected UMI: Counted reads where the UMI was error-corrected to match another similar UMI sequence.
Reads with invalid UMI: Reads that were not counted due to invalid UMI sequence. For example, pure homopolymer reads or reads containing Ns.
Sequencing saturation: Fraction of reads with duplicate UMIs. 1 - ( UMIs / Reads).
Unique cell-barcodes: Overall number of unique cell-barcode sequences in counted reads only.
Total UMIs: Overall number of unique cell-barcode and UMI combinations counted.

Cell Metrics

UMI threshold for passing cells: Number of UMIs required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction genic reads in cells: Counted reads assigned to cells that passed the filters.
Fraction reads in putative cells: All counted reads assigned to cells that passed the filters.
Median reads per cells: Total counted reads per cell that passed the filters.
Median UMIs per cells: Total counted UMIs per cell that passed the filters.
Median genes per cells: Genes with at least one UMI per cell that passed the filters.
Total genes detected: Genes with at least one UMI in at least one cell that passed the filters.

Per-cell metrics

The <prefix>.scRNA.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction.

ID: Unique numeric ID for the cell-barcode. The ID corresponds to the line number of that barcode in the UMI count matrix (*.mtx) output.
Barcode: The cell-barcode sequence.
TotalReads: Total reads with the cell-barcode sequence. This includes error corrected reads.
GeneReads: Reads (primary read alignments) counted towards a gene.
UMIs: Total number of UMIs in counted reads.
Genes: Unique genes detected.
Mitochondrial Reads: Reads mapped to mitochondrial genome.
Filter: The following are the available filter values:
- PASS: Cell-barcode passes the filter.
- LOW: UMI count is below threshold.

Barcode error correction

Either the corrected cell-barcode is at least two times more frequent across all input reads
Or the corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not

Genotype Sample Demultiplexing

For information on the cell-hashing demultiplexing method, see Cell-Hashing.

Command-Line Options

You can use the following command line options for scRNA demultiplexing.

One of the two:
- --scrna-demux-sample-vcf — If using genotype-based sample demultiplexing, specify the VCF file that contains the sample genotypes.
- --scrna-demux-reference-vcf — If using genotype-free sample demultiplexing, specify the VCF file that contains the genotypes of a population with a similar genetic background to the samples you are using.
--scrna-demux-detect-doublets — Enable the doublet detection in genotype-based sample demultiplexing. The default value is false.
--scrna-demux-number-samples — The number of samples you are using. This option is only applicable when using an external VCF reference specified with the scrna-demux-reference-vcf option, for genotype-free sample demultiplexing.

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-based demultiplexing.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublet=true \
--scrna-demux-sample-vcf=sample.vcf

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-free demultiplexing.

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position 0_15 \
--scrna-umi-position 16_25 \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/8 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublet=true \
--scrna-demux-reference-vcf=sample.vcf \
--scrna-demux-number-samples=4

Outputs

You can find information related to the output of genotype-based scRNA sample demultiplexing in the following three files.

Column

Description

The <prefix>.scRNA.demux.tsv file contains sample demultiplexing statistics that were used to infer sample identity of each cell.

Column

Description

The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.

Column

Description

Tumor/non-tumor Cell Demultiplexing

Command-Line Options

--scrna-demux-tumor-normal-vcf — A tumor-normal somatic variant call VCF that has tumor and normal sample columns. It is recommended for the VCF to have more than ~100 confident SNVs but that threshold may vary based on the coverage of the scRNA data. Having too few somatic variants runs the risk of not having enough scRNA reads intersecting the SNVs, leading to most cells being unclassified due to insufficient information.
--scrna-demux-tn-threshold — The log-likelihood ratio threshold for classifying a cell as tumor or non-tumor. The default value is set to 2.0, but we recommend user to test control samples (e.g. high and low tumor-fraction samples) to find an optimal threshold tailored to their experimental design and sample type.

Example DRAGEN Single Cell RNA command line using genotype-based demultiplexing:

dragen \
--enable-rna=true \
--enable-single-cell-rna=true \
--umi-source=fastq \
--scrna-barcode-position="0_15" \
--scrna-umi-position="16_25" \
-r reference_genomes/Mus_musculus/mm10/DRAGEN/9 \
-a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz \
-1 lib1_S7_L001_R2_001.fastq.gz \
--umi-fastq lib1_S7_L001_R1_001.fastq.gz \
--scrna-barcode-sequence-list=737K-august-2016.txt \
--RGID=1 \
--RGSM=sample1 \
--output-dir=/staging/out \
--output-file-prefix=sample1 \
--scrna-demux-detect-doublets=false \
--scrna-demux-tumor-normal-vcf=TN_somatic.vcf.gz \
--scrna-demux-tn-threshold=2

Outputs

The output for tumor/non-tumor scRNA cell demultiplexing can be found in the following three files.

Column

Description

Column

Description

Column

Description

Cell-Hashing

id — Identifier of the feature. For example, ADT_A1018.
name — Name of the feature. For example, ADT_Hu.HLA.DR.DP.DQ_A1018.
read — Read 1 (R1) or Read 2 (R2).
position — Position on the specified read, including starting position and the length of the feature barcode. For example, a position of 0_15 represents a feature barcode that starts at position 0 and has a length of 15.
sequence — DNA sequence of the feature barcode. For example, CAGCCCGATTAAGGT.
feature_type — Type of the feature. For example, Antibody Capture.

Command-Line Options

To enable cell-hashing sample demultiplexing, specify the following command line options.

--scrna-cell-hashing-reference --- Specify a CSV or FASTA cell-hashing reference file that contains sample-specific oligo-tags.
--scrna-demux-detect-doublets --- Enable doublet detection in cell-hashing sample demultiplexing. The default value is false.
--scrna-demux-sample-fastq --- Output sample-specific FASTQ files. See Sample-Specific FASTQ Output Files for more information.

Outputs

Column

Description

The <prefix>.scRNA.demux.tsv file contains sample demultiplexing statistics that were used to infer sample identity of each cell.

Column

Description

Sample-Specific FASTQ Output Files

Feature Counting

To enable feature counting, specify the following command-line options:

--scrna-feature-barcode-reference — Specify a CSV or FASTA feature reference file that contains feature barcodes. The CSV file format is similar to the format used for Cell-Hashing.
--scrna-feature-barcode-groups — If feature reads are specified as separate FASTQ files, specify a comma-separated list of read groups that correspond to feature FASTQ files.

The output of feature counting is appended to the output expression matrix. The extended expression matrix contains additional rows corresponding to the features.

DRAGEN Single-Cell Pipeline

scRNA

Input Files

Alignment Reference

Read Input

UMI in Read Name

Single-cell RNA Fastq-list Files

Separate UMI Fastq Files

Using Multiple Libraries

DRAGEN Single-cell Settings

Barcode Position

Known Barcode Lists

Cell Filtering

Additional Options

Command-line Example

Outputs

Counts

Loading output in a dense matrix

Alignments

Overall Metrics

Barcode Read Metrics

Transcript/Feature Read Metrics

UMI count metrics

Cell Metrics

Per-cell metrics

Barcode error correction

Genotype Sample Demultiplexing

Command-Line Options

Outputs

Tumor/non-tumor Cell Demultiplexing

Command-Line Options

Outputs

Cell-Hashing

Command-Line Options

Outputs

Sample-Specific FASTQ Output Files

Feature Counting

scATAC

Input Files

Alignment Reference

Read Input

UMI in Read Name

Single-cell ATAC Fastq-list Files

Separate UMI Fastq Files

Using Multiple Libraries

DRAGEN Single-cell Settings

Barcode Position

Known Barcode Lists

Cell Filtering

Additional Options

Command-line Example

Outputs

Counts

Loading output in a dense matrix

Alignments

Overall Metrics

Barcode Read Metrics

Genomic Fragment Metrics

Cell Metrics

Per-cell metrics

Barcode error correction

Peak Calling

Peak Annotation

Transcription Factor Motif Analysis

Single-Cell Multiomics

Input Files

Alignment Reference

Read Input

DRAGEN Single-cell Settings

Barcode Position

Known Barcode Lists

Cell Filtering

Command-line Example

Outputs

Counts

Loading output in a dense matrix

Alignments

Overall Metrics

Per-cell Metrics

Single-Cell Multiomics

Separate `UMI Fastq` Files

Separate `UMI Fastq` Files

Separate `UMI Fastq` Files

Separate `UMI Fastq` Files