DRAGEN 16S Pipeline
Summary
The DRAGEN 16S pipeline is a rapid, kmer-based informatics solution designed for microbial classification and community profiling from mixed flora and metagenomic sample types. The pipeline delivers powerful secondary analysis of Illumina 16S sequencing data, with steps for read QC (optional), taxonomic classification, result filtering (optional), and reporting.
The input data to the pipeline is a FASTQ or a set of FASTQS. The results of the analysis are provided in a set of output files with a table for each sample that gives read counts for detected taxa, a JSON with QC results and metadata, and a table that summarizes the results for all of the samples in the analysis.
The pipeline is supported by a database containing 14,676 bacterial and 660 archaeal full-length 16S rRNA gene sequences. The required resource files can be downloaded following these instructions.
Alternatively, the analysis can be performed using a custom set of reference sequences defined by the user in a FASTA file.
Input Options
The input to the 16S pipeline is a FASTQ or set of FASTQs. The options for providing the input data to the pipeline include:
Analyze a single-ended FASTQ with
--16s-fastq-1
Analyze a pair of FASTQs with
--16s-fastq-1
and--16s-fastq-2
Analyze one or more FASTQs with
--16s-fastq-list
The FASTQ list file must be a CSV with at least the following fields:
RGSM: the sample ID; used to name output files
Read1File: the path to the R1 file
Read2File: the path to the R2 file; leave blank for single-ended data
Other fields in the CSV will be ignored.
Samples that were sequenced across multiple lanes will take up more than one line in the CSV. For example, a sample sequenced across two lanes will have two lines, one for each lane.
Example FASTQ list file with a mixed collection of FASTQ files:
RGSM,Read1File,Read2File
SingleEndedSampleA,/path/to/sampleA_S1_L001_R1_001.fastq.gz,
PairedEndedSampleB,/path/to/sampleB_S1_L001_R1_001.fastq.gz,/path/to/sampleB_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L001_R1_001.fastq.gz,/path/to/sampleC_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L002_R1_001.fastq.gz,/path/to/sampleC_S1_L002_R2_001.fastq.gz
Output Files
Sample Summary CSV
The file is named <outputFilePrefix>.<sampleId>.summary.csv
. It is a tabular file with counts of how reads were classified in the sample. Each unique classification is a row, and the number of hits and percentage of total hits is recorded for each row. Partial classifications (e.g., where a Class-level classification was made, but not an Order-level classification) has unclassified entries left blank.
Sample Metadata JSON
The file is named <outputFilePrefix>.<sampleId>.ap.json
. It contains general metadata, version information, sample QC, statistics, and analysis options specified by the user. See Report JSON format for further details.
Analysis Aggregate CSVs
The set of files are named according to the taxonomic level they correspond to: <outputFilePrefix>.<Kingdom,Phylum,Class,Order,Family,Genus,Species>_Level_Aggregate_Counts.csv
. Seven tabular files with per-level aggregate classified read counts for all samples in the analysis. Each row is a unique classification label that occurred in one or more samples. Each column is a unique sample ID. Entries in the table are the number of classified reads (or read pairs for paired-end data) for a given classification label and sample ID.
Command Line Settings
Required Inputs
--enable-16s
Enables the 16S Pipeline. (Default=false).
--output-file-prefix
Prefix for all output files and the identifier for the analysis.
--output-directory
Directory for all output files.
--16s-fastq1
Path to the R1 FASTQ file.
--16s-fastq2
Path to the R2 FASTQ file.
--16s-fastq-list
Path to a FASTQ list CSV file.
--16s-db-dir
Directory containing the 16S database files.
--16s-custom-references
Path to a FASTA file containing reference sequences to use for the analysis, instead of the pre-built database.
Optional Inputs
--intermediate-results-dir
Area for temporary files.
--16s-read-qc
Whether the reads should be put through a QC process prior to the k-mer analysis. (Default=true).
--16s-report-threshold
Filter classification results from output that have read counts below this threshold. (Default=0).
--16s-report-per-sample
Output reports per sample. (Default=true).
--16s-report-aggregate
Output aggregate reports. (Default=true).
--num-threads
Number of threads to use for the analysis. (Default=1).
Example Commands
The following is an example of an analysis with a paired-end sample and the pre-built database:
/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-db-dir /path/to/databases/Refseq-RDP-v1/1.0.1/
--num-threads 20
The following is an example of an analysis with a paired-end sample, a set of custom reference sequences, and an increased report threshold:
/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-custom-references /path/to/references.fna
--16s-report-threshold 5
--num-threads 20
Last updated
Was this helpful?