Illumina PIPseq scRNA
Last updated
Last updated
The DRAGEN PIPseq scRNA Pipeline is designed to process data obtained with Illumina’s proprietary Illumina Single Cell 3' RNA Prep library based on PIPseq technology [1]. The command line option --scrna-enable-pipseq-mode=true
runs the scRNA pipeline with methods that are optimized to process the unique structure of the Illumina Single Cell 3' RNA Prep library. All scRNA pipelines take in the same input files and produce the same output files.
See Illumina Single Cell 3` RNA Prep Kit for more information on the library prep.
The purpose of this step is to:
Assign barcodes to reads
Remove unwanted technical sequences from the data before mapping
Assign binning indices (BIs) and intrinsic molecular identifiers (IMIs) to reads
The figure below describes the fragmentation process taking place in PIPseq chemistry. Each read in R1 includes a barcode sequence, which identifies the individual PIP, followed by a 3-base binning index (BI) sequence. R2 includes the sequenced cDNA constructs created from the captured mRNA, which contain random cut sites that serve as intrinsic molecular identifiers (IMIs), and are used for molecular counting. Up to fifteen different cut sites can be created from a single captured molecule using five cycles of whole transcriptome amplification. The genome alignment position, which will be different for each cut site, serves as the IMI for each read.
The PIPseq platform uses a tiered barcode system, where every one of four tiers includes one of a specified list of possible barcodes. Barcode matching can therefore be done highly efficiently by matching each tier in isolation. The pipeline allows a Hamming distance of 1 per tier, i.e., the bases in the R1 FASTQ corresponding to each tier’s position can differ from a barcode by one base and still be matched to that barcode. The pipeline identifies the barcode combination and the binning index for each read. The barcode positions, BI position and the expected barcode list are set automatically with the --scrna-enable-pipseq-mode
argument, so they do not need to be specified by the user.
Custom barcode positions, BI positions and expected barcode list can be set with the following arguments:
Barcode positions
--scrna-barcode-position
0_7+11_16+20_25+31_38
BI position
--scrna-umi-position
39_41
Expected barcode list
--scrna-barcode-sequence-list
<PIPSEQ_BARCODE_SEQ_LIST_PATH>
The R2 file, which includes the cDNA construct, also includes technical sequences related to the PIPseq chemistry, namely the template switch oligo (TSO) and poly-A sequences. The prevalence of those sequences is inversely correlated with the fragment size. Because a read is less likely to be mapped accurately if it contains extraneous sequences, TSO sequences are removed from the 5’ end and poly-A sequences are removed from the 3’ end in the R2 FASTQ file. In addition, the first base, a constant T, is removed from the 5’ end of R2. These are done using the DRAGEN trimmers, with the settings automatically included under the --scrna-enable-pipseq-mode
flag. Stats on the number of reads and bases trimmed can be found in the <prefix>.trimmer_metrics.csv
file.
In PIPseq mode, the pipeline automatically ignores pseudogene, shortRNA and rRNA biotypes during mapping. to override this behavior, use --annotation-file-ignore-biotypes=none.
PIPSeq chemistry generates fragments with unique starting positions, which serve as intrinsic identifiers that can be used for molecular counting. Reads are grouped together based on the cell barcode and the assigned gene. Within each barcode and gene combination, IMIs are grouped in one of 64 bins, based on the 3-base binning index. For each bin, all identical IMIs are collapsed into a single count, since they are likely PCR duplicates of the same fragment generated during library prep.
At this point, any barcode and gene combination which has ten or fewer unique binning indexes is assigned the number of unique binning indexes as its final count estimate. The pipeline then totals the number of IMIs associated with each remaining barcode and gene combination, and divides that number by a correction factor estimated from the data, which accounts for the additional copies generated from a single captured molecule during five amplification cycles. The final count estimate is the maximum between the floor of this value and the number of unique binning indexes for this barcode and gene.
Estimating the correction factor is described in detail in the next section. The example below shows molecular counting with a correction factor of 1.35. The barcode and gene combination in this example has 12 unique binning indexes, so the correction factor is applied to produce the final count. The correction factor of 1.35 indicates that, on average, a single molecule produced 1.35 fragmented copies in this sample. Note that the corrected count will always be at least equal to the number of unique bins, and will not exceed the number of unique IMI and bin combinations. The result of this correction process, 14, is the number that is recorded in the count matrix for this barcode and gene.
A single molecule can produce anywhere between one and fifteen fragmented copies during five amplification cycles. The true distribution of the number of fragmented copies varies between datasets, and is influenced by factors such as sample type and sequencing depth. For a given dataset, the correction factor is an estimate of the average of this distribution, the average IMIs per molecule (IPM) in the sample.
Because all IMIs from the same parent molecule share a binning index, the number of unique binning indexes observed within a specific barcode and gene is determined by the number of molecules, and is not impacted by the number of IMIs that were produced by the molecules. This means the probabilistic relationship between the number of unique bins and the true number of molecules in a barcode and gene combination is constant, and is the result of random sampling from the 64 possible bin indexes when each molecule is captured. We can represent this process by the Coupon Collector Problem (CCP) [2], which defines this distribution of the number of samples (molecules) required to a certain number of distinct types (unique bins) among the total collection of unique types (64 binning indexes). The combination of constant molecular probabilities based on the number of unique bins and observed unique IMIs in barcode and gene combinations in the sample enable the IPM estimation. We select the subset of barcode and gene combinations with between 5 and 32 unique bin indexes. For each of these barcode and genes, we divide the total number of IMIs by the average number of molecules expected based on the number of unique bin indexes. The estimated average IMIs per molecule (IPM) for the sample is the average of these values across this subset of barcode and genes. The subset of barcode and gene combinations is used rather than all available combinations because we expect the CCP model to be most representative of the true molecular count and noise from sources such as sequencing errors or reads outside the cell fraction to be minimized within this subset.
After calculating the average IMIs per molecule in the sample, we apply the IPM to estimate the molecular count from the IMI count in individual barcode and gene combinations. The estimated molecular count for a barcode and gene is the total number of IMIs divided by the IPM, rounded down. The more true molecules a barcode and gene combination has, the closer the estimated molecular count is expected to be with this approach (as the number of molecules increases, the true average IMIs per molecule for a barcode and gene should approach the average IPM of the sample). For barcode and gene combinations with very few molecules, the number of unique bins is expected to be a better predictor of the molecular count than the number of IMIs because the variance in the true IMIs per molecule among this group is high since the number of samples (molecules) in each individual barcode and gene combination is low. For this reason, we apply IPM correction for barcode and gene combinations with more than 10 unique bin indexes, and otherwise the corrected count is equal to the number of unique bin indexes.
When the --scrna-enable-pipseq-mode
batch option is set to true, the following options are automatically included:
--enable-rna=true
--enable-single-cell-rna=true
--bypass-anchor-mapping=true
--annotation-file-ignore-biotypes=pseudogene,shortrna,rrna
--scrna-barcode-position=0_7+11_16+20_25+31_38
--scrna-umi-position=39_41
--scrna-count-introns=true
--umi-source=read1
--scrna-barcode-sequence-list=<PIPSEQ_BARCODE_SEQ_LIST>
--read-trimmers=fixed-len,adapter,polya
--trim-r1-5prime=1
--trim-polya-min-trim=12
--trim-adapter-r1-5prime=<PIPSEQ_ADAPTER_LIST>
--trim-adapter-stringency=12
The <PIPSEQ_BARCODE_SEQ_LIST>
and <PIPSEQ_ADAPTER_LIST>
are packaged with DRAGEN and the path to the files will be automatically detected. If any of these options are also included in the DRAGEN command line, then it will override the value set by the batch option. NOTE: The binning index position is set using the --scrna-umi-position
option.
The following is an example command line to run the DRAGEN Single Cell RNA PIPseq Pipeline.
See the general DRAGEN Single-Cell RNA pipeline page for information about additional options.
The primary outputs are the count matrix files and the metrics files which are identical to the single cell RNA pipeline.
There are two sets of matrix files: the raw matrix and the filtered matrix files. The raw matrix files are:
<prefix>.scRNA.matrix.mtx.gz
Count of unique MIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix.
A threshold is applied to segregate the cells into PASS cells and LOW non-cell barcodes. The filtered matrix files are:
<prefix>.scRNA.filtered.matrix.mtx.gz
Count of unique MIs for each PASS cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
Cell-barcode sequence for each PASS cell from the matrix.
<prefix>.scRNA.filtered.features.tsv.gz
Gene name, ID and feature type for each gene and feature in the matrix (identical to <prefix>.scRNA.features.tsv.gz
).
The barcodeSummary file <prefix>.scRNA.barcodeSummary.txt
includes a full list of the PASS and LOW cells as well as the number of reads, molecules, and genes for each barcode.
The single-cell RNA metrics file <prefix>.scRNA_metrics.csv
includes summary metrics on the read-barcodes, the transcripts and features, the molecule counts, and the cell counts. It also includes the threshold used for calling PASS cells.
A full description of the output files can be found in the DRAGEN Single-Cell RNA pipeline user guide.
Note that additional output files and figures for PIPseq mode may be found in the DRAGEN Reports output.
DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Setting --scrna-enable-pipseq-crispr-mode
to true activates this mode.
Activating PIPseq CRISPR mode automatically configures DRAGEN for processing feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in the same single cell workflow, so extra steps are added to identify the hook sequence of gRNA reads. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.
The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.
When PIPseq CRISPR mode is activated, --scrna-enable-pipseq-mode
must also be set to true. For a standard PIPseq CRISPR workflow, the CRISPR input is usually provided as an additional pair of FASTQ files and passed to DRAGEN with a FASTQ list input. Furthermore, a CRISPR barcode reference file must be provided using the --scrna-feature-barcode-reference
argument.
The CRISPR barcode reference file should be in CSV format with the following header fields:
id
Feature/gene ID
name
Feature/gene name
read
Read that the guide RNA is on (R1 or R2)
target
Target or "hook" sequence. If the value is set to "5p", then the target sequence is based on the beginning of the read (default value: "5p")
position
Displacement of the start of the identifying guide RNA sequence relative to the start of the target sequence (can be positive or negative)
sequence
Guide RNA sequence
feature_type
Name of feature (e.g. CRISPR Direct Capture)
Examples of target, position, and sequence values for common use cases are provided below:
The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz
. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz
raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz
matrix file.
The following additional metrics are output in <prefix>.scRNA_metrics.csv
when PIPseq CRISPR mode is enabled:
Total CRISPR reads matching known barcodes
Total number of guide RNA FASTQ barcodes that match the known barcode list after error-correction
CRISPR tag mapping rate
Total count of guide RNAs in the raw matrix divided by "Total CRISPR reads matching known barcodes"
CRISPR fraction valid reads in cells
Total count of guide RNAs in the filtered matrix divided by "Total CRISPR reads matching known barcodes"
Fraction passing cells with CRISPR reads
Number of filtered cell barcodes with at least 1 guide RNA divided by the total number of filtered (or PASS) cell barcodes
Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.
“The Coupon Collector Problem.” 2022. University of Alabama in Huntsville. April 24, 2022. https://stats.libretexts.org/@go/page/10250.