Only this pageAll pages
Powered by GitBook
1 of 24

DRAGEN Single Cell RNA

Loading...

Get Started

Loading...

Run Set Up

Loading...

Loading...

Loading...

Loading...

Analysis Methods

Loading...

Loading...

Loading...

Loading...

Analysis Results

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tertiary Analysis

Loading...

Prerequisites

  • NovaSeq 6000/6000Dx, NextSeq 2000, or NovaSeq X Series

  • Illumina Single Cell 3' RNA Prep Kit

    • T2: 1 library per sample, 1 FASTQ pair

    • T10: 1 library per sample, 1 FASTQ pair

    • T20: 1 library per sample, 1 FASTQ pair

    • T100: 4 libraries per sample, 1 FASTQ pair

    • T1000: 8 libraries per sample, 8 FASTQ pairs

For more information about the library preparation kit, refer to the .

A cloud account with a valid subscription. For information on registering your BaseSpace Sequence Hub or Illumina Connected Analytics account, refer to .

Illumina Single Cell Prep Support Site
Software Registration page

Sample Sheet Introduction

The sample sheet includes a list of samples and their index sequences, along with additional information required to run DRAGEN Single Cell RNA software. Appropriate index adapter sequences are determined by the assay used to perform analysis.

When running analysis on ICA, a valid sample sheet can be created by:

Run Planning in BaseSpace Sequence Hub

For NextSeq 2000, the samplesheet created by the Run Planning tool needs to be exported and uploaded to the instrument. When using the NovaSeq X, the samplesheet will automatically be on the instrument.

Use the steps below to create a DRAGEN Single Cell RNA run with the BaseSpace Run Planning tool. To get to the Run Planning tool, open BaseSpace Sequence Hub and navigate to the Runs page by using the navigation bar or by opening the menu on the left-hand side. From the New Run dropdown menu select Run Planning.

Step 1: Run Settings

Step 2: Configuration

On NovaSeq X Series, this page is called "Configuration 1". The top right-hand corner of the UI displays the Read 1, Read 2, Index 1 and Index 2 entered on the previous run settings screen.

Step 3: Run Configuration and Analysis Settings

Step 4: Run Review

Once all details are captured and pass validation, review the run information and choose the Edit option to correct any information.

For NovaSeq 6000/6000Dx and NextSeq 1000/2000, Export the sample sheet to be uploaded to the instrument.

For NovaSeq X Series, the run can be saved as a draft or as a planned run (via Save as Draft and Save as Planned buttons respectively). Either selection will save the run to the Planned Runs screen on BaseSpace. Once a run is saved as Planned, it will appear on the NovaSeq X Series instrument where it can be selected for sequencing.

The sample sheet for Planned runs can be downloaded by selecting the planned run and File -> Download -> SampleSheet.

A sample sheet is required for each analysis with DRAGEN Single Cell RNA software. A sample sheet is a comma-separated value (*.csv) file format used by Illumina instruments, platforms, and analysis pipelines to store settings and data for sequencing and analysis. The DRAGEN Single Cell RNA software is compatible with the v2 sample sheet. For general information on the v2 sample sheet, refer to .

BaseSpace Run Planner (preferred), see for details

Downloading and modifying a sample sheet template following the requirements, see for details

The BaseSpace Sequence Hub Run Planning tool is used to generate a valid sample sheet in v2 format for use on a supported sequencer. Filling out the form on the user interface will produce a sample sheet with the required fields filled in that can be used to auto-launch a DRAGEN Single Cell RNA analysis. Refer to for more information about the Run Planning workflow and auto-launch.

Parameter Name
Required?
Description
Parameter Name
Required?
Description
Parameter Name
Required?
Description

For more information about the auto-launch, refer to . For additional information on run planning, refer to .

Illumina Connected Software - Sample Sheet
Run Planning in BaseSpace Sequence Hub
Sample Sheet Requirements

Run Name

Required

Run Name can contain 255 alphanumeric characters, dashes, underscores, periods, and spaces; and must start with an alphanumeric, a dash or an underscore.

Run Description

Optional

Run Description can contain 255 characters except square brackets, asterisks, and commas.

Instrument Platform

Required

Choose from DRAGEN Single Cell RNA software supported instruments:

  • NovaSeq X Series

  • NovaSeq 6000/6000Dx

  • NextSeq 1000/2000

Secondary Analysis

Required

Select BaseSpace / Illumina Connected Analytics.

Read 1

Required only for Instrument Platform NovaSeq X Series

45 for DRAGEN Single Cell RNA analysis. May be different if running multiple applications in a single run.

Index 1

Required only for Instrument Platform NovaSeq X Series

10 for DRAGEN Single Cell RNA analysis. May be different if running multiple applications in a single run.

Index 2

Required only for Instrument Platform NovaSeq X Series

10 for DRAGEN Single Cell RNA analysis. May be different if running multiple applications in a single run.

Read 2

Required only for Instrument Platform NovaSeq X Series

72 for DRAGEN Single Cell RNA analysis. May be different if running multiple applications in a single run.

Sample Container ID

Optional

Unique identifier for the container that holds the sample.

Introduction

The DRAGEN Single Cell RNA application is a secondary analysis tool that can process multiplexed single cell RNA-Seq data in binary base call (BCL) files produced by NovaSeq 6000/6000Dx, NextSeq 1000/2000, and NovaSeq X Series sequencing systems to a cell-by-gene expression matrix.

You can perform secondary analysis in the cloud via BaseSpace Sequence Hub or Illumina Connected Analytics (ICA). When performing secondary analysis in the cloud, the analysis application launches automatically in BaseSpace Sequence Hub or ICA after the sequencing workflow completes.

Cloud Analysis Auto-launch
Cloud Analysis Auto-launch
Plan Runs on Basespace Sequence Hub

Application

Required

Select DRAGEN Single Cell RNA - 4.4.4

Description

Optional

Optional Text Field

Library Prep Kit

Required

Select Illumina Single Cell 3’ RNA Prep

Index Adapter Kit

Required

Select a supported index adapter kit:

  • Illumina Single Cell UD 8 Indexes

  • Illumina Single Cell UD Indexes Set A

Reference Genome

Required only for Instrument Platform NovaSeq X Series

Select the appropriate genome reference for the sample type.

Description

Optional

Optional Text Field

Library Prep Kit

Required

Auto-populated from previous step

Index Adapter Kit

Required

Auto-populated from previous step

Index Reads

Required

Defaults to 2 indexes

Read Type

Required

Defaults to Paired End

Read Lengths

Required

Defaults to 45:10:10:72. May be different if running multiple applications in a single run. The default is compatible with 150 cycle SBS kits. If using a larger kit, the Read 2 cycle information can be increased.

There are diminishing returns for increased read lengths as the insert will read through the cDNA sequence into the poly-A region with longer read lengths.

Read 1 should not be updated as it contains the cell barcode and binning index. Longer read lengths will need to be trimmed. Shorter read lengths will impact cell barcode identification

Sample Table

Required

The Lanes, Sample ID, and Index ID should be filled out based on how the sample will be prepared based on the library preparation kit used. See <link to prep docs , but need direct link> for more information. The optional Project field is used to specify the associated BaseSpace Project to output data to. If left empty, Project will default to the Project name derived from the Experiment/Run name.

Override Cycles

Required

Defaults to U45;I10;I10;Y72. May be different if running multiple applications in a single run.

Reference Genome

Required

Select the appropriate genome reference for the sample type.

RNA Annotation File

Optional

For custom references, use this field to select the corresponding GTF file to use for annotation.

For built in references, use this field to override default annotations. The following list shows the default GTFs being used for annotation.

  • GENCODE v19

    • Homo sapiens [UCSC] hg19 v5

    • Homo sapiens [UCSC] hg19 v5 Pangenome

    • Homo sapiens [NCBI] hs37d5 v5

    • Homo sapiens [NCBI] hs37d5 v5 Pangenome

  • GENCODE v44

    • Homo sapiens [1000 Genomes] hg38 v5

    • Homo sapiens [1000 Genomes] hg38 v5 Pangenome

  • GENCODE vM23

    • Mus musculus [UCSC] mm10

  • ENSEMBL 98

    • Rattus norvegicus [UCSC] rn6

Configuration Type

Optional

Defaults to Illumina Single Cell 3’ RNA

Barcode Read

Required

Defaults to Read 1

RNA Library Type

Required

Defaults to Stranded Forward

Barcode Position

Required

Defaults to 0_7+11_16+20_25+31_38

UMI/BI Position

Required

Defaults to 39_41

FASTQ Processing

For data processed with the Illumina Single Cell 3’ RNA Prep Kit, each read in R1 includes a cellular barcode sequence followed by a 3-base binning index (BI) sequence. R2 includes the sequences cDNA constructs created from the captured mRNA, which contain random cut sites that serve as intrinsic molecular identifiers (IMIs) and are used for molecular counting.

When selecting human, it is recommended to use linear references for RNA analysis. See for more information.

When selecting human, it is recommended to use linear references for RNA analysis. See for more information.

For more information about FASTQ processing refer to the detailing the DRAGEN PIPseq scRNA Pipeline.

DRAGEN Reference Support
DRAGEN Reference Support
DRAGEN documentation

Manual Launch on BaseSpace

The DRAGEN Single Cell RNA analysis can be manually launched to analyze previously generated FASTQ files by using a BaseSpace App.

Use the steps below to create a manually launch the DRAGEN Single Cell RNA app in BaseSpace. To get to the app, open BaseSpace Sequence Hub and navigate to the Apps page by using the navigation bar or by opening the menu on the left-hand side. Select or search for the DRAGEN Single Cell RNA app from the list of available apps. Select Launch Application to provide details for your analysis. Detailed steps are provided below.

Select Input Data

Configuration

Parameter Name
Required?
Description

Analysis Name

Required

Name of the analysis

Save Results To

Required

Select the project that will store the analysis results.

Biosample(s)

Required

Browse and select the biosamples to be analyzed.

Reference

Required

Select the reference genome to use in the analysis. The app provides support for common human, mouse, and rat genomes in addition to supporting custom references built by the DRAGEN Reference Builder app.

Custom Reference Files

Optional

  • Ensure "Include RNA Data in Reference" is enabled

Gene Annotation File

Optional

For custom references, select the corresponding GTF file to use. For built in references, the following list shows the default GTFs being used. This can be overridden for custom annotations by using this field.

  • GENCODE v19

    • Homo sapiens [UCSC] hg19 v5

    • Homo sapiens [UCSC] hg19 v5 Pangenome

    • Homo sapiens [NCBI] hs37d5 v5

    • Homo sapiens [NCBI] hs37d5 v5 Pangenome

  • GENCODE v44

    • Homo sapiens [1000 Genomes] hg38 v5

    • Homo sapiens [1000 Genomes] hg38 v5 Pangenome

  • GENCODE vM23

    • Mus musculus [UCSC] mm10

  • ENSEMBL 98

    • Rattus norvegicus [UCSC] rn6

Map/Align Output

Required

Select whether to output the alignments in BAM or CRAM format.

Library Kit

Required

Select your Illumina Single Cell 3' RNA Prep Kit.

Barcode Position

Required

Defaults to 0_7+11_16+20_25+31_38 for Illumina Single Cell 3' RNA Prep Kits.

UMI Position

Required

Defaults to 39_41 for Illumina Single Cell 3' RNA Prep Kits.

Barcode/UMI Read

Required

Defaults to Read 1 for Illumina Single Cell 3' RNA Prep Kits.

Barcode/UMI Source

Required

Select the appropriate setting that matches how FASTQ files were generated.

  • FASTQ Header – the FASTQ files were generated with the OverrideCycles sample sheet setting writing the R1 sequence to the FASTQ header

  • Barcode/UMI Read - the Read 1 FASTQ files were created without setting OverrideCycles in the sample sheet so the Read 1 FASTQ file contains the full sequencing read.

Barcode Sequence List File

Optional

Specify a file containing valid cell barcode sequences. Maps to --single-cell-barcode-sequence-whitelist in command line arguments. Not required for Illumina Single Cell 3' RNA Prep Kits.

RNA Library Type

Required

Auto-populated for Illumina Single Cell 3' RNA Prep Kits.

Poly-A Trimming

Optional

Select the poly-A trimming method. Disabled for Illumina Single Cell 3' RNA Prep Kits.

Demultiplexing

Parameter Name
Required?
Description

Demultiplexing Method

Optional

Select genotype-based or genotype-free sample demultiplexing.

Sample VCF

Optional

Specify a VCF file for genotype-based demultiplexing. Maps to --single-cell-demux-sample-vcf in command line arguments.

Reference VCF

Optional

Specify a VCF file for genotype-free demultiplexing. Maps to --single-cell-demux-reference-vcf in command line arguments.

Number of Samples

Optional

Specify the number of samples for genotype-free demultiplexing. Maps to --single-cell-demux-number-samples in command line arguments.

Detect Doublets

Optional

Enable doublet detection in sample demultiplexing. Maps to --single-cell-demux-detect-doublets in command line arguments.

Cell Hashing and Feature Counting

Parameter Name
Required?
Description

Cell Hashing and Feature Counting

Optional

Use the checkboxes to enable cell hashing and feature counting using feature barcode UMI.

Feature Barcode UMI Position

Optional

Feature barcode UMI position is in the format of <start index>_<end index>. ex: 11_18 specifies an 8 bp sequence from positions 11 to 18 (inclusive). The first position is 0.

Cell Hashing Reference

Optional

Specify a CSV or FASTA cell-hashing reference file that contains sample-specific oligo-tags. Maps to --single-cell-cell-hashing-reference in command line arguments.

Detect Doublets

Optional

Select the checkbox to enable doublet detection in cell-hashing sample demultiplexing. Maps to --single-cell-demux-detect-doublets in command line arguments.

Feature Barcode Reference

Optional

Specify a CSV or FASTA feature reference file that contains feature barcodes. Maps to --single-cell-feature-barcode-reference in command line arguments.

Advanced Settings

Parameter Name
Required?
Description

Expected Number of Cells

Optional

Specify the expected number of cells. The DRAGEN default is used if not set. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.

Thresholding Method

Optional

Specify the method for determining the count threshold value.

  • Ratio: DRAGEN estimates the count threshold as max(Te, Tm). Tm is 10% of the count seen in the cell at the 10th percentile of the expected cells. Te is 50% of the count seen in the least abundant expected cell.

  • Inflection: DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.

  • Fixed: The count threshold is set to force the expected number of cells.

Maps to --single-cell-threshold in command line arguments.

Additional Arguments

Use the Additional Arguments section to define any custom settings. Below are some commonly used additional arguments.

Argument
Description

--annotation-file-ignore-biotypes=none

When selecting the Illumina Single Cell 3’ RNA Library Prep Kit, the pipeline will automatically ignore pseudogenes, shortRNA, and rRNA biotypes during mapping. This behavior can be disabled by adding "--annotation-file-ignore-biotypes=none".

Launch Application

Accept the BaseSpace Labs disclaimer and Launch Application to begin your analysis.

The DRAGEN Single Cell RNA app only supports Biosample inputs. For more information on Biosamples refer to the .

Custom references can be generated from a FASTA file and optionally a GTF file with the DRAGEN Reference Builder app. For more information, refer to .

BaseSpace Data Model
Prepare a Reference Genome

Transcript Counting

Within each barcode and gene combination, IMIs are grouped in one of 64 bins, based on the 3-base binning index. For each bin, all identical IMIs are collapsed into a single count, since they are likely PCR duplicates of the same fragment generated during library prep.

Any barcode and gene combination that has ten or fewer unique binning indexes is assigned the number of unique binning indexes as its final count estimate. The pipeline then totals the number of IMIs associated with each remaining barcode and gene combination, and divides that number by the IPM correction factor, which accounts for the additional copies generated from a single captured molecule during five amplification cycles. The final count is the maximum between the floor of this value and the number of unique binning indexes for this barcode and gene.

Error Correction

Because all IMIs from the same parent molecule share a binning index, the number of unique binning indexes observed within a specific barcode and gene is determined by the number of molecules and is not impacted by the number of IMIs that were produced by the molecules. This means that the probabilistic relationship between the number of unique bins and the true number of molecules in a barcode and gene combination is constant and is the result of random sampling from the 64 possible bin indexes when each molecule is captured. For the subset of barcode and gene combinations with between 5 and 32 unique bin indexes, dividing the total number of IMIs by the average number of molecules expected based on the number of unique bin indexes gives you the estimated average IMIs per molecule (IPM).

The estimated molecular count for a barcode and gene is the total number of IMIs divided by the IPM, rounded down. The more true molecules a barcode and gene combination has, the true average IMIs per molecule should approach the average IPM of the sample. For barcode and gene combinations with very few molecules, the number of unique bins is expected to be a better predictor of the molecular count than the number of IMIs because the variance in the true IMIs per molecule among this group is high since the number of molecules in each individual barcode and gene combination is low. For this reason, IPM correction is applied for barcode and gene combinations with more than 10 unique bin indexes, and otherwise the corrected count is equal to the number of unique bin indexes.

For more information about Transcript Counting refer to the detailing the DRAGEN PIPseq scRNA Pipeline.

For more information about Error Correction refer to the detailing the DRAGEN PIPseq scRNA Pipeline.

DRAGEN documentation
DRAGEN documentation

PIPseq CRISPR Mode

DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Setting --scrna-enable-pipseq-crispr-mode to true activates this mode.

Activating PIPseq CRISPR mode automatically configures DRAGEN for processing feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in the same single cell workflow, so extra steps are added to identify the hook sequence of gRNA reads. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.

The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.

For more information about PIPseq CRISPR mode refer to the detailing the DRAGEN PIPseq scRNA Pipeline.

DRAGEN documentation

DRAGEN Report

Every run of Illumina DRAGEN Single Cell software produces a DRAGEN report in HTML format which includes QC metrics for trimming, mapping, and single cell analysis as well as a barcode rank plot. Below is a description of metrics and plots on each tab of the DRAGEN report.

Single-Cell RNA
DRAGEN-FastQC
Mapping
Trimmer

Trimmer

Trimmed Reads

Metric
Description

Input Reads

Total number of input reads to DRAGEN

Max Read Length

Maximum detected input read length

Average Input Read Length

Average input read length to DRAGEN, after any adapter trimming by BCL Convert

Masked

Total number 3’ Poly-G bases masked from mapping

Trimmed

Total number of reads trimmed by DRAGEN

Filtered

Total number of reads removed from the input by DRAGEN

Trimmer Metrics

Metric
Description

Fixed-Length

Total number of fixed-length trimmed reads

Adapter

Total number of adapter trimmed reads

Mapping

Read-Level Metrics

Metric
Description

Total input reads

Total number of input reads

QC-failed reads

Total number of reads failing one or more quality checks

% QC-failed

Percentage of reads failing one or more quality checks

Unique reads

Total number of unique reads

% Unique

Percentage of reads that are unique

Mapped reads

Total number of mapped reads, adjusted for filtered and excluded targets

% Mapped

Percentage of reads that are mapped, adjusted for filtered and excluded targets

Base-Level Metrics

Metric
Description

Total Bases

Total number of input bases

Mapped R1

Total number of mapped bases on R1

% Mapped R1

Percentage of R1 bases mapped

% Q30 R1

Percentage of R1 bases with phred quality score >=30

Deduplication

The deduplication bar chart reflects the ratio of unique and deduplicated reads.

Read MAPQs

The read MAPQs bar chart reflects the percent of reads in various categories of phred quality score (Q0-Q10, Q10-Q20, Q20-Q30, Q30-Q40, Q40+).

Single-Cell Clustering

UMAP Plot

The UMAP plot allows visualization of individual cells in 2D space to capture the similarities between cells. Clustering is performed using the Louvain method.

Top Marker Genes

The Top Marker Genes table includes the top 10 marker genes for each cluster of the UMAP. For each gene, the gene name, ENSEMBL gene ID, log2 fold change, and pValue are shown. The contents of the table are also available in CSV format by selecting Download CSV.

DRAGEN FastQC

DRAGEN FASTQC Plots

Plot
Description

Base Quality by Position

Phred-scale quality value for bases at a given location

Mean Base Quality by Position

Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read

Read Length Distribution

Total number of reads with each observed length

Read Quality Distribution

Total number of reads with each observed average Phred-scale quality score

%GC Content

Percentage of sequences with each GC content across the whole length of each sequence compared to a modelled normal distribution of GC content

Read Quality by %GC Content

Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%

Ambiguous Base Content by Position

Percent ambiguous bases at a given location

Base Content by Position

Percent of bases of each specific nucleotide at given locations in the read

Adapter Content by Position

Percentage of the proportion of your library which has seen each of the adapter sequences at each position. Once a sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages you see will only increase as the read length increases

Sample Sheet Requirements

The DRAGEN Single Cell RNA software has optional and required fields in addition to general sample sheet requirements. Below is a description of the fields in each section.

[Sequencing_Settings]

Parameter
Required?
Details

LibraryPrepKits

Required

Accepted values are: IlluminaSingleCell3RNAPrep

[BCLConvert_Settings]

Parameter
Required?
Details

SoftwareVersion

Required

The DRAGEN component software version. DRAGEN Single Cell RNA software requires 4.4.0.

NoLaneSplitting

Required

TRUE for DRAGEN Single Cell RNA software

TrimUMI

Required

0 for DRAGEN Single Cell RNA software

OverrideCycles

Required

U45;I10;I10;Y72 for DRAGEN Single Cell RNA software. May be different if running multiple applications in a single run.

FastqCompressionFormat

Required

gzip

[BCLConvert_Data]

Parameter
Required?
Details

Sample_ID

Required

Must match a Sample_ID listed in the [Cloud_DragenSingleCellRna_Data] and [Cloud_Data] section.

Index

Required

Index 1 sequence

Index2

Required

Index 2 sequence

Lane

Only for NovaSeq 6000/6000 Dx workflow

Indicates which lane corresponds to a given sample. Enter a single numeric value per row. Cannot be empty, i.e. the analysis fails if the Lane column is present without a value in each row.

[Cloud_DragenSingleCellRna_Settings]

Parameter
Required?
Details

SoftwareVersion

Required

The DRAGEN component software version. DRAGEN Single Cell RNA software requires 4.4.0.

EnablePipseqMode

Required

TRUE for Illumina Single Cell 3’ RNA kit. Maps to --scrna-enable-pipseq-mode in command line arguments.

ReferenceGenomeDir

Required

Location of reference genome TAR containing a DRAGEN hash table and optionally a GTF.

BarcodeRead

Required

Read1 for Illumina Single Cell 3’ RNA kit

RnaLibraryType

Required

SF for Illumina Single Cell 3’ RNA kit (stranded forward). Maps to --rna-library-type in command line arguments.

BarcodePosition

Required

0_7+11_16+20_25+31_38 for Illumina Single Cell 3’ RNA kit. Maps to --scrna-barcode-position in command line arguments.

UmiPosition

Required

39_41 for Illumina Single Cell 3’ RNA kit. Maps to --scrna-umi-position in command line arguments.

[Cloud_DragenSingleCellRna_Data]

Parameter
Required?
Details

Sample_ID

Required

Must match a Sample_ID listed in the [BCLConvert_Data] and [Cloud_Data] section.

[Cloud_Settings] for Auto-launch

Parameter
Required?
Details

GeneratedVersion

Not Required

The cloud version used to create the sample sheet. Optional if manually updating a sample sheet. (ex: 1.17.0.202411192008).

Cloud_Workflow

Not Required

ica_workflow_1

BCLConvert_Pipeline

Required

The value is a universal record number (URN). The valid value is:

urn:ilmn:ica:pipeline:730df76f-715a-45bf-9500-e6e0ce1ab224#BclConvert_v4_3_13

Cloud_DragenSingleCellRna_Pipeline

Required

The value is a URN in the following format:

urn:ilmn:ica:pipeline:b3c5ab5f-2853-4873-93c4-61a807f844a7#DRAGEN_Single_Cell_RNA_4-4-2_-_Sequencer_Integration_Only

[Cloud_Data] for Auto-Launch

Parameter
Required?
Details

Sample_ID

Required

Must match a Sample_ID listed in the [BCLConvert_Data] and [Cloud_DragenSingleCellRna_Data] section.

ProjectName

Not Required

The BaseSpace Sequence Hub project name

LibraryName

Not Required

Combination of sample ID and index values in the following format: sampleID_Index_Index2.

LibraryPrepKit

Required

The Library Prep Kit used

IndexAdapterKitName

Required

The Index Adapter Kit used

The preferred method for creating sample sheets is to use

Run Planning in BaseSpace Sequence Hub

Single-Cell RNA

Single Cell RNA Metrics

Metric
Description

Total Input Reads

Total number of reads for the sample

% Mapped

Percentage of reads that are mapped to the genome

Total Barcoded Reads

Number of reads with barcodes that match the whitelist

Passing Cells

Total unique barcodes that pass the count threshold for a passing cell

% Reads in Passing Cells

Number of reads in passing barcodes divided by the total number of reads in all cells of barcodes

Mean Reads per Cell*

Mean reads per cell (Total gene reads / Passing cells)

Median Reads per Cell

The median number of reads per passing barcode

Median Molecules per Cell

The median number of captured RNAs per passing barcode

Median Genes per Cell

The median number of unique genes identified per passing barcode

% Sequencing Saturation

Percentage chance that if you sequenced an additional read, it would have already been observed at least once

*Only available for analyses with PIPseq CRISPR mode enabled

Barcode Rank Plot

The barcode rank plot (often referred to as the “knee plot”), orders barcodes based on the number of transcripts associated with them. Typically, the cell barcodes are concentrated at the top of the rank plot, whereas the background barcodes are concentrated in the lower portion of the plot. The purpose of the cell calling is to find a point in the first “knee” area that separates the cells from the background.

Extended Single-Cell RNA Metrics

This section is only available for analyses with PIPseq CRISPR mode enabled.

Metric
Description

Input Gene Expression Reads

Total input gene expression reads

Input Feature Reads

Total input feature reads

Barcoded Gene Expression Reads

Total barcoded gene expression reads

Barcoded Feature Reads

Total barcoded feature reads

Unique Exon Reads

Unique exon matching reads

Unique Intron Reads

Unique intron matching reads

Filtered Antisense Reads

Filtered antisense reads

Filtered Ambiguous Reads

Filtered ambiguous reads

Filtered Low MAPQ Reads

Filtered low MAPQ reads

Filtered Non-Matching Reads

Filtered non-matching reads

Mitochondrial Reads

Mitochondrial reads

Total Gene Reads

Total gene reads

Counted Gene Reads

Total counted genes

Molecules

Total molecules

Genes Detected

Total genes detected

Barcode Summary Metrics

Metric
Description

Cell Type

Barcode classification category. “PASS” for passing cells or “LOW” for background cells.

Total Gene Reads

Total number of genic reads in the category

Molecules

Total number of molecules in the category

Genes

Total number of unique genes detected for each cell in the category

Mitochondrial Reads

Total number of mitochondrial reads in the category. This is based on gene names that include prefixes like “MT-” and may not work with every mapping reference.

Feature Molecules*

Total number of CRISPR or other target molecules in the category

Features*

Total number of unique CRISPR sequences or other targets detected in the category

*Only available for analyses with PIPseq CRISPR mode enabled

Feature Metrics

This section is only available for analyses with PIPseq CRISPR mode enabled.

Metric
Description

Input Feature Reads

Total input feature reads

Barcoded Feature Reads

Total barcoded feature reads

Feature Matching Reads

Feature matching reads

Feature Reads

Total Feature Reads

Feature Molecules

Total feature molecules

Median Feature Reads per Cell

Median feature reads per passing cell

Feature Molecules per Cell

Median feature molecules per passing cell

Features per Cell

Median features per passing cell

Features Detected

Total features detected

CRISPR Reads

Total CRISPR reads matching known barcodes

% Mapped Features

Percentage of CRISPR tags mapped

% Features in Cells

Percentage of valid CRISPR tags in cells

% Cells with Features

Percentage passing cells with CRISPR reads

Illumina Connected Multiomics

Illumina Connected Multiomics (ICM) is available for further tertiary analysis of single-cell and other multiomic data.

Getting Started

Default Single Cell Analysis

Below are explanations of the steps that are run in the default single-cell analysis that is automatically launched on import of single-cell data in ICM. Also included below are the instructions to launch each step manually if input parameters need to be adjusted from the default settings.

Normalize Counts

Because different cells will have a different number of total counts, it is important to normalize the data prior to downstream analysis. For droplet-based single cell isolation and library preparation methods that use a 3' counting strategy, where only the 3' end of each transcript is captured and sequenced, we recommend the following normalization - 1. CPM (counts per million), 2. Add 1, 3. Log2. This accounts for differences in total UMI counts per cell and log transforms the data, which makes the data easier to visualize.

  • Click the counts node you wish to normalize

  • Click Normalization and scaling in the context-sensitive task menu on the right

  • Click Normalization

  • Click Use recommended to add the recommended normalization scheme

This adds CPM (counts per million), Add 1, and Log2 to the Normalization order panel. Normalization steps are performed in descending order.

  • Click Finish to apply the normalization

Note in the default single cell analysis pipeline, ICM runs the normalization step again to scale data by subtracting the mean of the feature and dividing by the standard deviation.

Filter Features

A common task in single-cell RNA-Seq analysis is to filter the data to include only informative genes (features). Because there is no gold standard for what makes a gene informative or not and ideal gene filtering criteria depends on your experimental design and research question, ICM has a wide variety of flexible filtering options. The Filter features step can also be performed before normalization or after normalization.

  • Select a data node containing the count matrix

  • Click Filtering in the task menu

  • Click Filter features

  • Select the Filter type and Filter criteria desired

There are four categories of filter available - noise reduction, statistics-based, feature metadata, and feature list.

The noise reduction filter allows you to exclude genes considered background noise based on a variety of criteria. The statistics-based filter is useful for focusing on a certain number or percentile of genes based on a variety of metrics, such as variance. The metadata, saved list , and manual list filters allow you to filter your data set to include or exclude particular genes.

For example, you can use a noise reduction filter to exclude genes that are not expressed by any cell in the data set, but were included in the matrix file. To do so:

  • Click the Noise reduction filter check box

  • Set the Noise reduction filter to Exclude features where value <= 0 in at least 99.9% of cells using the drop-down menus and text boxes

  • Click Finish to apply the filter

The default single cell pipeline uses the statistics-based filter to filter for the top 10 features with the highest variance.

PCA

Principal components (PC) analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. Because PCA is used to reduce the dimensionality of the data prior to clustering as part of a standard single cell analysis workflow, it is useful to examine the results of PCA for your data set prior to clustering.

  • Select a data node containing the normalized and filtered count matrix

  • Click Exploratory analysis in the task menu

  • Click PCA from the drop-down list

  • Select the number of features to include

  • Select the number of PCs to calculate

You can choose Features contribute equally to standardize the genes prior to PCA or allow more variable genes to have a larger effect on the PCA by choosing by variance. By default, we take variance into account and focus on the most variable genes.

If you have multiple samples, you can choose to run PCA for each sample individually or for all samples together by selecting or not selecting the Split by sample option.

  • Click Finish to run

A new PCA task node will be produced on the task graph for the analysis. When complete, double-click the PCA task node to open the 3D PCA scatter plot in data viewer.

Beside PCA coordinates of the cells, PCA task report also includes, the Scree plot, the component loadings table, and the PC projections table.

The Scree plot lists PCs on the x-axis and the amount of variance explained by each PC on the y-axis, measured in Eigenvalue. The higher the Eigenvalue, the more variance is explained by the PC. Typically, after an initial set of highly informative PCs, the amount of variance explained by analyzing additional PCs is minimal. By identifying the point where the Scree plot levels off, you can choose an optimal number of PCs to use in downstream analysis steps like graph-based clustering, UMAP and t-SNE.

Graph-based Clustering

Graph-based clustering identifies groups of similar cells using PC values as the input. By including only the most informative PCs, noise in the data set is excluded, improving the results of clustering.

  • Click the PCA data node

  • Click Exploratory analysis in the task menu

  • Click Graph-based clustering

Clustering can be performed on each sample individually or on all samples together.

  • Select the Clustering algorithm to use. The default Single-Cell analysis uses the Louvain algorithm.

  • Check Compute biomarkers to compute features that are highly expressed when comparing each cluster

  • Select the number of PCs to use

  • Click Configure to access the Advanced options

The Number of principal components can be set based on the your examination of the Scree plot and component loadings table. The default value of 100 is likely exhaustive for most data sets, but may introduce noise that reduces the number of clusters that can be distinguished.

  • Click Finish to run the task

A new Graph-based clusters data and Biomarkers data node will be generated along with the task nodes

  • Double-click the Graph-based clusters node to see the cluster results and statistics. The Graph-based clustering result lists the Total number of clusters and what proportion of cells fall into each cluster as well as Maximum modularity which is a measurement of the quality of the clustering result where optimal modularity is 1.

  • Double-click the Biomarkers node to see the computed biomarkers if you have selected this option. The Biomarkers node includes the top features for each graph-based cluster. It displays the top-10 genes that distinguish each cluster from the others. Download at the bottom right of the table can be used to view and save more features. These are calculated using an ANOVA test comparing the cells in each group to all the other cells, filtering to genes that are 1.5 fold upregulated, and sorting by ascending p-value. This ensures that the top-10 genes of each cluster are highly and disproportionately expressed in that cluster.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

  • Click the Graph-based clusters or PCA node

  • Click Exploratory analysis in the task menu

  • Click UMAP

  • Select the number of PCs to use

  • Click Configure to access the Advanced options

  • Click Finish to run

If you have multiple samples, you can choose to run UMAP for each sample individually or for all samples together using the Split cells by sample option.

Like Graph-based clustering, UMAP takes PC values as its input and further reduces the data down to two or three dimensions. For consistency, you should use the same number of PCs as the input for UMAP that you used for Graph-based clustering.

A new UMAP task node will be produced. When complete, double-click the UMAP node to open the UMAP task report. Use the panel on the left to modify the plot or add more plots to this Data viewer session.

The UMAP scatter plot is interactive and can be viewed in 2D or 3D. The UMAP plot is 3D by default. You can rotate the 3D plot by left-clicking and dragging your mouse or using Control under Configure. You can zoom in and out using your mouse wheel. You can pan by right-clicking and dragging your mouse. You can use Style to modify color, shape, size, and labeling (e.g. add a fog effect to improve depth perception on the plot). Add a 2D plot clicking New plot, selecting 2D Scatter plot and selecting UMAP as the source of the data.

Other Single-Cell Analysis Tasks

QA/QC

The Single-cell QA/QC task in ICM enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

  • Click a Single cell counts data node

  • Click the QA/QC section of the task menu

  • Click Single cell QA/QC

By default, all samples are used to perform QA/QC. You can choose to split the sample and perform QA/QC separately for each sample.

If your Single cell counts data node has been annotated with a gene/transcript annotation, the task will run without a task configuration dialog. However, if you imported a single cell counts matrix without specifying a gene/transcript annotation file, you will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog. Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible.

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique that prioritizes local relationships to build a low-dimensional representation of the high-dimensional data that places objects that are similar in high-dimensional space close together in the low-dimensional representation. This makes t-SNE well suited for analyzing high-dimensional data when the goal is to identify groups of similar objects, such as cell types in single cell RNA-Seq data.

  • Click the Graph-based clusters or PCA node

  • Click Exploratory analysis in the task menu

  • Click t-SNE

  • Select the number of PCs to use

  • Click Configure to access the Advanced options

  • Click Finish to run

The t-SNE scatter plot visualization has the same functionality and style elements as the UMAP plot described above.

Differential Analysis

A common goal in single cell analysis is to identify genes that distinguish a cell type. To do this, you can use the differential analysis tools in ICM.

  • Click the Normalized counts results node

  • Click Statistics in the toolbox

  • Click Differential Analysis

  • Select ANOVA as the Method to use for differential analysis and click Next

  • Select and add the categorical and numeric factors for analysis

  • Click Next

The differential analysis tool can be used to compare one group of cells to another group of cells to identify genes or features that distinguish cells. Common examples include determining distinguishing genes between one cell type and all others, two cell types, or the same cell type between two experimental conditions.

The comparison builder can be used to create any of these tests. The top panel is the numerator for fold-change calculations so the experimental or test groups should be selected in the top panel. The bottom panel is the denominator for fold-change calculations so the control group should be selected in the bottom panel.

  • Add classifications to the numerator

  • Add classifications to the denominator

  • Select Combine for a single comparison or Pairwise for a factorial set of comparisons

  • Select Add comparison

  • Optionally select the checkbox to Apply lowest average coverage filter to exclude a feature if the geometric average of its values over all samples is less than the specified value.

  • Click Configure to access the Advanced options

  • Click Finish to run

When completed, double click the newly generated data node to open the ANOVA task report. The ANOVA task report lists genes on rows and the results of the statistical test (p-value, fold change, etc.) on columns. Genes are listed in ascending order by the p-value of the first comparison so the most significant gene is listed first.

Filter for Significant Genes

Using the filter control panel on the left, we can filter to just the genes that are significantly different for the comparison.The number of genes at the top of the filter control panel updates to indicate how many genes are left after the filters are applied.

Click Generate filtered node to generate a filtered version of the table for downstream analysis. The ANOVA report will close and a new task, the Filter list task, will run and generate a filtered Feature feature list data node.

Gene set enrichment

While a long list of significantly different genes is important information about a cell type, it can be difficult to identify what the biological consequences of these changes might be just by looking at the genes one at a time. Using enrichment analysis, you can identify gene sets and pathways that are over-represented in a list of significant genes, providing clues to the biological meaning of your results.

  • Click the Feature list data node produced by the Differential analysis filter

  • Click Biological interpretation

  • Click Gene set enrichment

  • Select the Database to use. ICM distributes the gene sets from the Gene Ontology Consortium, but Gene set enrichment can work with any custom or public gene set database.

  • Choose the latest assembly available from the Gene set drop-down

  • Click Finish

When completed, double-click the Gene set enrichment task node to open the task report.

The Gene set enrichment task report lists gene sets on rows with an enrichment score and p-value for each. It also lists how many genes in the gene set were in the input gene list and how many were not. Clicking the Gene set ID links to the geneontology.org or KEGG page for the gene set.

Hierarchical clustering / heatmap

Once we have filtered to a list of significantly different genes, we can visualize these genes by generating a heatmap.

  • Click the Filtered feature list data node produced by the Differential analysis filter

  • Click Exploratory analysis in the toolbox

  • Click Hierarchical clustering / heatmap

The hierarchical clustering task will generate the heatmap; choose Heatmap as the plot type. You can choose to Cluster features (genes) and cells (samples) under Feature order and Cell order in the Ordering section. You will almost always want to cluster features as this generates the clear blocks of color that make heatmaps comprehensible. For single cell data sets, you may choose to forgo clustering the cells in favor of ordering them by the attribute of interest.

  • Select Feature order

  • Select Cell order

  • Optionally add any additional Filtering

  • Click Configure to access the Advanced options

  • Click Finish to run

Cell Typing with ScType

ScType allows automated cell-type identification based on scRNA-seq data along with a comprehensive cell marker database as background information.

  • Click the data node containing the non-normalized count matrix

  • Click on Classification > Single cell type in the toolbox

  • Select the marker databse from the drop-down menu, the original ScType database is provided by default

  • Select categorical attributes to Categorize by

  • Optionally Filter tissue types

  • Select the SC Type algorithm to use

  • Click Configure to access the Advanced options

  • Click Finish to run

A new scType classification task node will be produced. When complete, double-click the Single cell type node to open the results of the cell-type identification. For each cell, the tissue, sctype result, and typescore are reported.

Secondary Analysis Results

The following table describes the files created during secondary analysis:

File
Description
Logging into ICM
Creating a Study from a ICA Project
Viewing Resulting and Navigating in ICM

<Sample_ID>.scRNA.bam

Binary Alignment Map (BAM) files containing information about all reads in the input FASTQ files that were mapped to the reference genome

<Sample_ID>.scRNA.bam.bai

Index file for the BAM for use by downstream applications

<Sample_ID>.scRNA.barcodeCounts.txt

Text file containing the counts per barcode

<Sample_ID>.scRNA.barcodeSummary.tsv

Summary of barcode statistics

<Sample_ID>.scRNA_metrics.csv

Single cell metrics summary with assay sensitivity and quality metrics

<Sample_ID>.scRNA.matrix.mtx.gz

Sparse matrix with rows that represent features and genes detected, and columns that consist of all barcodes that were detected

<Sample_ID>.scRNA.features.tsv.gz

Information about the features corresponding to the rows of the sparse matrix

<Sample_ID>.scRNA.barcodes.tsv.gz

List of barcodes corresponding to the columns of the sparse matrix

<Sample_ID>.scRNA.filtered.matrix.mtx.gz

Filtered sparse matrix with rows that represent features and genes detected, and columns that consist of all barcodes that were detected

<Sample_ID>.scRNA.filtered.features.tsv.gz

Information about the features corresponding to the rows of the filtered sparse matrix

<Sample_ID>.scRNA.filtered.barcodes.tsv.gz

List of barcodes corresponding to the columns of the filtered sparse matrix

Accessing Results

For information on tracking and viewing run and analysis results in BaseSpace Sequence Hub, refer to .

To view results on ICA, you may either click on "View Files in ICA" in the top right corner of your BSSH Analysis page, or directly access the analysis in ICA. It will be in a BSSH managed project with the same name as your BSSH workgroup. For information on viewing analysis results on your Illumina Connected Analytics account, refer to .

View Data on Basespace
Viewing Data on ICA