DRAGEN 16S Pipeline

Summary

The DRAGEN 16S pipeline is a rapid, kmer-based informatics solution designed for microbial classification and community profiling from mixed flora and metagenomic sample types. The pipeline delivers powerful secondary analysis of Illumina 16S sequencing data, with steps for read QC (optional), taxonomic classification, result filtering (optional), and reporting.

The input data to the pipeline is a FASTQ or a set of FASTQS. The results of the analysis are provided in a set of output files with a table for each sample that gives read counts for detected taxa, a JSON with QC results and metadata, and a table that summarizes the results for all of the samples in the analysis.

The pipeline is supported by a database containing 14,676 bacterial and 660 archaeal full-length 16S rRNA gene sequences. The required resource files can be downloaded following these instructions.

Alternatively, the analysis can be performed using a custom set of reference sequences defined by the user in a FASTA file.

Input Options

The input to the 16S pipeline is a FASTQ or set of FASTQs. The options for providing the input data to the pipeline include:

  1. Analyze a single-ended FASTQ with --16s-fastq-1

  2. Analyze a pair of FASTQs with --16s-fastq-1 and --16s-fastq-2

  3. Analyze one or more FASTQs with --16s-fastq-list

The FASTQ list file must be a CSV with at least the following fields:

  1. RGSM: the sample ID; used to name output files

  2. Read1File: the path to the R1 file

  3. Read2File: the path to the R2 file; leave blank for single-ended data

Other fields in the CSV will be ignored.

Samples that were sequenced across multiple lanes will take up more than one line in the CSV. For example, a sample sequenced across two lanes will have two lines, one for each lane.

Example FASTQ list file with a mixed collection of FASTQ files:

RGSM,Read1File,Read2File
SingleEndedSampleA,/path/to/sampleA_S1_L001_R1_001.fastq.gz,
PairedEndedSampleB,/path/to/sampleB_S1_L001_R1_001.fastq.gz,/path/to/sampleB_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L001_R1_001.fastq.gz,/path/to/sampleC_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L002_R1_001.fastq.gz,/path/to/sampleC_S1_L002_R2_001.fastq.gz

Output Files

Sample Summary CSV

The file is named <outputFilePrefix>.<sampleId>.summary.csv. It is a tabular file with counts of how reads were classified in the sample. Each unique classification is a row, and the number of hits and percentage of total hits is recorded for each row. Partial classifications (e.g., where a Class-level classification was made, but not an Order-level classification) has unclassified entries left blank.

Sample Metadata JSON

The file is named <outputFilePrefix>.<sampleId>.ap.json. It contains general metadata, version information, sample QC, statistics, and analysis options specified by the user. See Report JSON format for further details.

Analysis Aggregate CSVs

The set of files are named according to the taxonomic level they correspond to: <outputFilePrefix>.<Kingdom,Phylum,Class,Order,Family,Genus,Species>_Level_Aggregate_Counts.csv. Seven tabular files with per-level aggregate classified read counts for all samples in the analysis. Each row is a unique classification label that occurred in one or more samples. Each column is a unique sample ID. Entries in the table are the number of classified reads (or read pairs for paired-end data) for a given classification label and sample ID.

Command Line Settings

Option
Description

Required Inputs

--enable-16s

Enables the 16S Pipeline. (Default=false).

--output-file-prefix

Prefix for all output files and the identifier for the analysis.

--output-directory

Directory for all output files.

--16s-fastq1

Path to the R1 FASTQ file.

--16s-fastq2

Path to the R2 FASTQ file.

--16s-fastq-list

Path to a FASTQ list CSV file.

--16s-db-dir

Directory containing the 16S database files.

--16s-custom-references

Path to a FASTA file containing reference sequences to use for the analysis, instead of the pre-built database.

Optional Inputs

--intermediate-results-dir

Area for temporary files.

--16s-read-qc

Whether the reads should be put through a QC process prior to the k-mer analysis. (Default=true).

--16s-report-threshold

Filter classification results from output that have read counts below this threshold. (Default=0).

--16s-report-per-sample

Output reports per sample. (Default=true).

--16s-report-aggregate

Output aggregate reports. (Default=true).

--num-threads

Number of threads to use for the analysis. (Default=1).

Example Commands

The following is an example of an analysis with a paired-end sample and the pre-built database:

/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-db-dir /path/to/databases/Refseq-RDP-v1/1.0.1/
--num-threads 20

The following is an example of an analysis with a paired-end sample, a set of custom reference sequences, and an increased report threshold:

/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-custom-references /path/to/references.fna
--16s-report-threshold 5
--num-threads 20

Last updated

Was this helpful?