Note: Some files may not be generated depending on user inputs and pipeline outcome
For each sample, the pipeline generates a directory named after the sample. This directory contains the following subdirectories:
consensus/
{sample_name}_sample_consensus.fasta
: Contains all hard-masked consensus sequences for the sample. Regions with coverage below minimum coverage depth for consensus sequence generation (10x by default) are considered not callable and are therefore “hard-masked” with letter N. Variant calling is not applied to these regions. If the user selected specific VSP or RVOP organisms to be reported, this file excludes consensus sequences that are generated but do not belong to the selected organisms.
{sample_name}.consensus_hard_masked_sequence.fa
: Identical to {sample_name}_sample_consensus.fasta
, except for sequence headers. Moreover, even if the user selected specific VSP or RVOP organisms to be reported, this file contains all consensus sequences, including those that do not belong to the selected organisms
{sample_name}.consensus_soft_masked_sequence.fa
: Identical to {sample_name}.consensus_hard_masked_sequence.fa
, except low-coverage regions are “soft-masked” with lower-case letters that match the reference. Variant calling is not performed in these regions.
{sample_name}{virus_name}_virus_consensus.fasta
: Contains hard-masked consensus sequences generated for a particular virus. If the virus is not segmented (i.e. one reference sequence for the virus), this file contains a single sequence and is identical to {sample_name}{virus_name}{segment_name}{sequence_accession} consensus.fasta.
{sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta
: Contains a hard-masked consensus sequence generated with a particular reference sequence.
{sample_name}.consensus_from_vcf.log
: Log file generated during consensus sequence generation.
contig/
{sample_name}_sample_contig.fasta
: Contains all de novo assembled contigs generated for the sample.
{sample_name}_unmapped_contig.fasta
: Contains de novo assembled contigs that could not be mapped to any reference sequence. Because de novo assembly is reference-free, these contigs may correspond to sequences that are too diverged from those in the reference database or sequences not included in the database.
{sample_name}_{virus_name}_virus_contig.fasta
: Contains de novo assembled contigs that mapped to a particular virus.
{sample_name}_{virus_name}_{segment_name}_{sequence_accession}_contig.fasta
: Contains de novo assembled contigs that mapped to a particular reference sequence, which resulted in choosing the reference sequence for short-read alignment and generating {sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta.
{sample_name}_reference_selection.log
: Log file generated during reference selection.
map_align/
{sample_name}_unfiltered_tumor.bam
or {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam
: Short reads mapped to all selected reference sequences. If a primer set is available and properly mapped to the reference sequences, {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam
is provided as output. Its reads have primer sequences trimmed based on primer binding site coordinates.
{sample_name}-unmapped_S1_L001_R1_001.fastq.gz, {sample_name}-unmapped_S1_L001_R2_001.fastq.gz
: Short reads that do not map to any selected reference sequences. These may be used to find organisms not reported by the pipeline.
variant_calling/
{sample_name}.consensus_filtered_variants.vcf.gz
: Contains variant calls that passed consensus filter and were therefore applied to consensus sequences. They are a subset of variants listed in {sample_name}.hard-filtered.vcf.gz.
{sample_name}.consensus_filtered_variants_vcf_stats.txt
: Summarizes all variant calls in {sample_name}.consensus_filtered_variants.vcf.gz. Outputted by bcftools stats.
{sample_name}.consensus_filtered_variants_summary.csv
: Describes each variant call in {sample_name}.consensus_filtered_variants.vcf.gz.
{sample_name}.hard-filtered.vcf.gz
: Contains all variant calls.
{sample_name}.consensus_input_vcf_stats.txt
: Summarizes all variant calls in {sample_name}.hard-filtered.vcf.gz. Outputted by bcftools stats.
{sample_name}.consensus_all_variants_summary.csv
: Describes each variant call in {sample_name}.hard-filtered.vcf.gz.
metrics/
{sample_name}_num_reads.tsv
: Reports number of input reads, reads filtered out at each pre-processing step, reads mapped to each selected reference sequence, etc.
{sample_name}_metadata.json
: Reports parameters, read counts, amplicon counts, analysis results, and other metadata.
{sample_name}.consensus_metrics.csv
: Reports consensus metrics (e.g. total length of pre-trimmed sequence, fraction of masked bases, number of callable bases) for each generated consensus sequence
{sample_name}.consensus_coverage_from_filtered_bam.tsv
: Reports base-pair-resolution read coverage for all reference sequences based on short-read map/align step. Its three columns correspond to: chrom/accession, base position, coverage.
{sample_name}.consensus_callable_regions_from_filtered_bam.bed
: Reports callable regions in all reference sequences based on base-pair-resolution read coverage and minimum coverage depth for consensus sequence generation (10x by default). Bases outside of these regions are masked in consensus FASTA.
amplicon/
{sample_name}_processed_non_overlapping_amplicon.bed
: Lists all non-overlapping amplicon regions (i.e. covered with exactly one amplicon). If a custom primer set is provided, this file also lists selected reference sequences lacking amplicons. While amplicons are defined based on primer binding sites, for viruses like Influenza, reference sequences often lack primer binding sites, which are located at sequence ends. This results in defining fewer or sometimes no amplicons for an entire viral genome. To avoid this, each reference sequence without any amplicons defined is considered an amplicon and is listed in this file. All regions in this file are used for amplicon detection to infer sample concentration and determine if it is sufficient to apply variant calling and consensus sequence generation.
{sample_name}_calculate_amplicon_coverage.csv
: Reports coverage metrics (e.g. median coverage, fraction of bases with at least 1x coverage) for each non-overlapping amplicon region listed in {sample_name}_ processed_non_overlapping_amplicon.bed.
{sample_name}_generate_all_primer_bed.log
: Log file generated while defining amplicons for selected reference sequences and writing relevant BED files.
tertiary/
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.csv
: CSV output file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.tsv
: Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv
except in TSV format.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.json
: Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv
except in JSON format.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}_log.txt
: Log file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.
pangolin_{sample_name}_{sequence_accession}_lineage_report.csv
: CSV output file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.
pangolin_{sample_name}_{sequence_accession}_log.txt
: Log file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.
reference/
reference.bed
: Describes all reference sequences. If a custom reference was provided, sequence names may appear different in this BED file.
reference.json
: Same as reference.bed
but with more detail. If any of the sequences were renamed during the pipeline, this file provides the mapping between the original and renamed versions.