# DRAGEN 16S Pipeline

## Summary

The DRAGEN 16S pipeline is a rapid, kmer-based informatics solution designed for microbial classification and community profiling from mixed flora and metagenomic sample types. The pipeline delivers powerful secondary analysis of Illumina 16S sequencing data, with steps for read QC (optional), taxonomic classification, result filtering (optional), and reporting.

The input data to the pipeline is a [FASTQ or a set of FASTQS](#input-options). The [results of the analysis are provided in a set of output files](#output-files) with a table for each sample that gives read counts for detected taxa, a JSON with QC results and metadata, and a table that summarizes the results for all of the samples in the analysis.

The pipeline is supported by a database containing 14,676 bacterial and 660 archaeal full-length 16S rRNA gene sequences, see [16S Pipeline Pre-Built Database](https://help.connected.illumina.com/dragen/dragen-v4.4/product-guide/dragen-v4.4/dragen-16s-pipeline/dragen-16s-db). The required resource files can be downloaded following [these instructions](https://help.connected.illumina.com/dragen/dragen-v4.4/product-guide/dragen-v4.4/dragen-16s-db#downloading-the-database).

Alternatively, the analysis can be performed using a [custom set of reference sequences](https://help.connected.illumina.com/dragen/dragen-v4.4/product-guide/dragen-v4.4/dragen-16s-pipeline/dragen-16s-custom-db) defined by the user in a FASTA file.

## Input Options

The input to the 16S pipeline is a FASTQ or set of FASTQs. The options for providing the input data to the pipeline include:

1. Analyze a single-ended FASTQ with `--16s-fastq-1`
2. Analyze a pair of FASTQs with `--16s-fastq-1` and `--16s-fastq-2`
3. Analyze one or more FASTQs with `--16s-fastq-list`

The FASTQ list file must be a CSV with at least the following fields:

1. RGSM: the sample ID; used to name output files
2. Read1File: the path to the R1 file
3. Read2File: the path to the R2 file; leave blank for single-ended data

Other fields in the CSV will be ignored.

Samples that were sequenced across multiple lanes will take up more than one line in the CSV. For example, a sample sequenced across two lanes will have two lines, one for each lane.

Example FASTQ list file with a mixed collection of FASTQ files:

```
RGSM,Read1File,Read2File
SingleEndedSampleA,/path/to/sampleA_S1_L001_R1_001.fastq.gz,
PairedEndedSampleB,/path/to/sampleB_S1_L001_R1_001.fastq.gz,/path/to/sampleB_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L001_R1_001.fastq.gz,/path/to/sampleC_S1_L001_R2_001.fastq.gz
MulitLaneSampleC,/path/to/sampleC_S1_L002_R1_001.fastq.gz,/path/to/sampleC_S1_L002_R2_001.fastq.gz
```

## Output Files

### Sample Summary CSV

The file is named `<outputFilePrefix>.<sampleId>.summary.csv`. It is a tabular file with counts of how reads were classified in the sample. Each unique classification is a row, and the number of hits and percentage of total hits is recorded for each row. Partial classifications (e.g., where a Class-level classification was made, but not an Order-level classification) has unclassified entries left blank.

### Sample Metadata JSON

The file is named `<outputFilePrefix>.<sampleId>.ap.json`. It contains general metadata, version information, sample QC, statistics, and analysis options specified by the user. See [Report JSON format](https://help.connected.illumina.com/dragen/dragen-v4.4/product-guide/dragen-v4.4/dragen-16s-pipeline/report-json-format) for further details.

### Analysis Aggregate CSVs

The set of files are named according to the taxonomic level they correspond to: `<outputFilePrefix>.<Kingdom,Phylum,Class,Order,Family,Genus,Species>_Level_Aggregate_Counts.csv`. Seven tabular files with per-level aggregate classified read counts for all samples in the analysis. Each row is a unique classification label that occurred in one or more samples. Each column is a unique sample ID. Entries in the table are the number of classified reads (or read pairs for paired-end data) for a given classification label and sample ID.

## Command Line Settings

| Option                       | Description                                                                                                     |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------- |
| Required Inputs              |                                                                                                                 |
| `--enable-16s`               | Enables the 16S Pipeline. (Default=false).                                                                      |
| `--output-file-prefix`       | Prefix for all output files and the identifier for the analysis.                                                |
| `--output-directory`         | Directory for all output files.                                                                                 |
| `--16s-fastq1`               | Path to the R1 FASTQ file.                                                                                      |
| `--16s-fastq2`               | Path to the R2 FASTQ file.                                                                                      |
| `--16s-fastq-list`           | Path to a FASTQ list CSV file.                                                                                  |
| `--16s-db-dir`               | Directory containing the 16S database files.                                                                    |
| `--16s-custom-references`    | Path to a FASTA file containing reference sequences to use for the analysis, instead of the pre-built database. |
| Optional Inputs              |                                                                                                                 |
| `--intermediate-results-dir` | Area for temporary files.                                                                                       |
| `--16s-read-qc`              | Whether the reads should be put through a QC process prior to the k-mer analysis. (Default=true).               |
| `--16s-report-threshold`     | Filter classification results from output that have read counts below this threshold. (Default=0).              |
| `--16s-report-per-sample`    | Output reports per sample. (Default=true).                                                                      |
| `--16s-report-aggregate`     | Output aggregate reports. (Default=true).                                                                       |
| `--num-threads`              | Number of threads to use for the analysis. (Default=1).                                                         |

## Example Commands

The following is an example of an analysis with a paired-end sample and the pre-built database:

```
/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-db-dir /path/to/databases/Refseq-RDP-v1/1.0.1/
--num-threads 20
```

The following is an example of an analysis with a paired-end sample, a set of custom reference sequences, and an increased report threshold:

```
/opt/dragen/$VERSION/bin/dragen
--enable-16s true
--output-file-prefix analysis1
--output-directory /path/to/output/
--16s-fastq1 /path/to/input/sampleA_S1_L001_R1_001.fastq.gz
--16s-fastq2 /path/to/input/sampleA_S1_L001_R2_001.fastq.gz
--16s-custom-references /path/to/references.fna
--16s-report-threshold 5
--num-threads 20
```
