Next generation sequencing (NGS) data is notably huge in file size. Dealing with NGS data is not only time consuming but also puts constraints on hard disk space. This is especially true if analysis parameters need to be optimized. The Filter reads task is a very useful tool to get a subset of the raw data upon which optimization can be performed. The optimized parameters can then be saved and applied to the whole dataset.
Filter reads is only available for unaligned reads of FASTQ format. Select the Unaligned Reads data node then select Filter reads from the Pre-alignment tools section on the menu.
There are two options to filter reads: Subsample reads and Filter by read length.
To Subsample reads, specify how many reads you want to keep for every nth reads. For example: if the user specifies to "Keep one read for every 10 reads" (Figure 1), this means that for every 10 reads, the program will keep only 1 read. This is equivalent to keeping 10% of the data.
To Filter by read length, set the read length limits by choosing the minimum and maximum read length(s) to keep.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Trim bases task is used to trim bases from the 5'-end or 3'-end of the reads. The most obvious reason for Trim bases is to trim away poor quality bases from the read prior to alignment because these can potentially affect alignment rate.
The task allows user to trim reads in different ways (Figure 1), including:
Trim bases based on quality score
Trim bases from 3'-end
Trim bases from 5'-end
Trim bases from both ends
Trim bases from 5'-or 3'-end (Figures 2-3) allows a fixed number of bases to be trimmed away from the 5'- or 3'-end of the reads. These two functions are useful for when your read length is constant. This is not recommended if the read length is not constant, since good quality bases from shorter reads are likely trimmed away by these functions.
Trim bases from both ends (Figure 4) allows user to keep only bases from a fixed start and end position of the reads. This is particularly useful if poor quality bases are observed on both ends of the read. So instead of performing trim bases successively from the 5'- and 3'-end, the trim bases will only be performed once by trimming from both ends.
Trim bases based on quality score (Figure 5) is probably the most useful function to trim poor quality bases from the 5'- or 3'-ends of reads. This function allows dynamic trimming of bases depending on quality score. The trimming can be done from either 5'-end, 3'-end or both ends of the reads. The function evaluates each base from the end of the read and trims it away until the last base has a quality score greater than the specified threshold. For an extensive evaluation of read trimming effects on Illumina NGS data analysis, see Del Fabbro et. al. [1].
In some cases, the reads that result from base trimming can have very short read lengths and thus are not recommended for alignment. Thus, Partek® Flow® Flow provides the option to set a Min read length after base trimming. This discards reads that are shorter than the set length.
Also, reads could have a high percentage of N's or ambiguous bases. Thus, the Max N setting is available to discard reads with %Ns higher than the set threshold
The Quality encoding option refers to the Phred quality score encoded within the FASTQ input file. The list of available options are: Phred+33, Phred+64, Solexa+64 and Integers. Selecting Auto-detect will determine whether the quality encoding is Phred+33 or Phred+64. For Solexa data, you will need to select Solexa+64. For most of datasets, auto-detect option works very well with a few exception cases where the base quality score falls into the grey zone (ambiguous zone) of Phred+33 and Phred+64 score. However, if the quality-encoding scheme is known, we recommend to selecting the encoding format directly from the quality encoding list.
Figure 6 shows the options available for all the different selection of Trim bases function. Note the default Min read length is 25bp. For micro RNA sequencing data, this default Min read length needs to be set to a smaller value (we recommend 15) to account for mature microRNAs.
The Task Details page for Trim bases can be accessed by selecting the task node Trim bases, and subsequently selecting Task Details from the Task results section. In the Task details page, several sections are available:
General task information: contains information such as the task name, owner, status, submitted time, start, end and duration of the task
Output Files: contains the description of each output file. If you roll-over your mouse cursor to the file name, you will get the exact location of the file on the server. If you click on the file name, you will have the option to view up to 999 lines of the raw data. You can also download the file from the server.
Input Files: contains the information of input files. This section lists down all the input files used in the Trim bases task.
Input Parameters: contains the parameters used for running Trim bases function. This section tells what option has been selected for the Trim bases task. It includes all the parameters used for the task, such as minimum read length, maximum percentage of N's base, quality encoding, quality score threshold (if applicable) and how trimming is performed.
Command Lines: shows the commands used for running Trim bases function by the software Partek Flow
The Trim bases Task Report page can be accessed by selecting either the Trim bases task node or Trimmed reads data node and then selecting the Task Report from the Task results section of the context sensitive menu. There is a link at the bottom of the page to directly go to the Task Details page. The page displays the following components:
Summary table: gives the total number of reads in each sample, the total number of reads trimmed (i.e. with at least one base trimmed from the read), total number of reads removed (due to Min read length and Max N parameters), the average number of bases trimmed per read, the average read quality before trim bases and finally the average read quality after trim bases.
Stacked bar-chart: shows percentage of untrimmed reads, trimmed reads and removed reads are shown in a stacked bar-chart to compare all the samples.
Average base quality score per position of trimmed reads: shows the average base quality score at each position of the trimmed reads for all samples in the project.
The Trim bases function produces trimmed unaligned reads which is named as Trimmed Reads data node. The Trimmed Reads node will have the "trimmed" word appended to the filename. The Trimmed Reads data can be downloaded by selecting the Trimmed Reads node and then select Download data from the context sensitive menu. However, if you have access to the Partek Flow server, you can go to the Task Details page and identify the location of the output files from the Output Files section as described on the Trim Bases Task Details section above. The Trimmed Reads data node will have the same format as the raw data.
Del Fabbro C, Scalabrin S, Moragante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE. 2013; 8(12): e85024.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.none;">43rates
The existence of adapter sequences at the 5'-end or 3'-end of the reads has shown to be one of the major problems during alignment, causing the reads to be unaligned. Thus, removing adapter sequence is of utmost importance if the sequenced read length is longer than the molecule of interest, such as microRNA. The fact that mature microRNAs are short in length makes it almost certain that the adapter sequence will be sequenced at the 3'-end of the miRNA.
In order to know whether the data has been adapter-trimmed for microRNA data, we can look at the pre-alignment QA/QC of the raw data, specifically the read length distribution. If the read length distribution peaks at approximately 22-23 bases, this usually means the data has been adapter-trimmed. However, if you have a fixed length distribution, then very likely the data is not adapter-trimmed and you will need to get the adapter sequence from your vendor or service provider and use the Trim adapter function to trim away the adapter sequence.
Partek Flow software wraps Cutadapt [1], a widely used tool for adapter trimming. It can be used to trim adapter sequences in nucleotide-space data as well as color-space data.
In order to use Trim adapters function, you will need to know the adapter sequences. To trigger the Trim adapters function, please select Unaligned Reads node and then select Trim adapters from the Pre-alignment tools section of the task pane. In the Trim adapters page (Figure 1), paste the adapter sequences into the textbox and select the button.
There are three options when it comes to trimming the adapter sequence:
Trimming for adapter ligated to 3'-end: the adapter sequence and anything that follows it will be trimmed away from the 3'-end.
Trimming for adapter ligated to 5'-end or 3'-end: the adapter sequence is identified within the read or overlapping the 3'-end, then the adapter sequence and anything that follows it will be trimmed away. However, if the adapter sequence partially overlaps the 5'-end of the read, the initial portion of the read matching the adapter sequence is trimmed and anything that follows it is kept.
Trimming for adapter ligated to 5'-end: if the adapter sequence appears partially at the 5'-end or within the read, the preceding sequence including the adapter sequence is trimmed. User has the option to use a special character '^' at the beginning of the adapter sequence, meaning the adapter is 'anchored'. An anchored adapter must appear in its entirety at the 5'-end of the read (i.e. it is a prefix of the read).
For Trim adapters, more than one adapter sequences can be specified at once. When multiple adapters are provided, all adapters are evaluated based on how many bases it overlaps the read as well as the error rate. Adapters which have a lower number of overlapped nucleotides or high error rates are removed from consideration.
After that, the best adapter will be chosen based on the number of matching bases to the read. If there is a tie, adapters of the same type will be chosen in the order they are provided and adapters of different types will be chosen by type in the following order: first 3', then 5' or 3', and lastly 5' adapters.
There are cases when the Trim adapters function does not work properly, for example: the existence of N's base in the read, etc. Therefore, there are advanced options which allows user to configure how the matching is done to trim adapter sequence. The advanced options dialog box is shown in Figure 2.
The first section of advanced options is the Adapter options. This is used to configure how the matching between the adapter sequence and the read will be performed. This includes the maximum error rate allowed, the number of matched times, minimum length of overlapped bases, allowing Ns (ambiguous base) in adapter and whether N will be treated as wildcards. User can roll-over mouse cursor to the info button to get more information of each parameter.
The second section of advanced options is the Filtering options. This is used to filter adapter-trimmed reads which are shorter than the minimum read length. This is to avoid having reads too short because short reads gives non-unique alignment and we would like to avoid that.
The third section of advanced options is the Additional modification to reads. The quality cutoff is used to trim bad quality bases from the reads before trimming adapter. Quality encoding tells the quality score encoding for the raw data. The Reads names prefix and suffix is used to add prefix and suffix to the read ID. Lastly, the Negative quality zero if checked will convert all negative quality score base to zero.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011; 17: 10-12.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Flow provides Pre-alignment tools that allow the user to process next-generation sequencing data before proceeding to alignment. These tools are not only useful for controlling the quality of data, but can also be used for subsampling prior to analyzing the full dataset. There are three functions available in Pre-alignment tools:
User is expected to have preliminary understanding of:
File formats for next generation sequencing data
Phred-quality score
In order to show the Pre-alignment tools, select an Unaligned reads or Trimmed reads data node. They will appear on the context-sensitive menu on the right of the screen (Figure 1).
Different Pre-alignment tools are available for different formats of unaligned reads. For example: if the reads are in FASTQ format, then all four tools are available. On the other hand, if the unaligned reads are in FASTA or SFF format, then the Filter reads option is not available.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Trim tags task allows you to process unaligned read data with adaptors, barcodes, and UMIs using a Prep kit file that specifies the configuration of these elements in your NGS reads.
Click an Unaligned reads data node
Click the Pre-alignment QA/QC section of the toolbox
Click Trim tags
There are three parameters to configure - Prep kit, Keep untrimmed, and Map feature barcodes.
Selecting Keep untrimmed will generate a separate unaligned reads data node with any reads that do not match the structure specified by the prep kit. This option is off by default, to save on disk space. Selecting Map feature barcodes is only necessary for processing protein data from 10x Genomics' Feature Barcoding assay (v3+ chemistry). For single cell gene expression data, leave this option unchecked.
Partek distributes prep kits for processing several types of data:
10x Chromium Single Cell 3' v2
10x Chromium Single Cell 3' v3
10x Chromium Single Cell 5'
Drop-seq
Lexogen QuantSeq FWD-UMI
Bio-Rad SureCell WTA 3'
Fluidigm C1 mRNA Seq HT IFC
Rubicon Genomics ThruPLEX Tag-seq
1CellBio inDrop
If your data is from one of these sources, you can select the appropriate option in the Prep kit drop-down menu. If the data is from another source, you can build a custom prep kit file to process your data.
Choose a Prep kit from the drop-down menu
Click Finish to run Trim tags (Figure 1)
The output of Trim tags is a Trimmed reads data node. An additional Untrimmed reads data node will be generated if the Keep untrimmed option was selected.
The task report provides a table with the total reads, reads retained, % reads retained, reads removed, and % reads removed for each sample (Figure 2). You can click Download at the bottom of the table to save a text file copy to your computer.
Select Other / Custom from the Prep kit name drop-down menu
Give the new prep kit a name
Choose Build prep kit
You can select Import prep kit if you have a Prep kit .zip file downloaded from Partek Flow.
Click Create (Figure 3)
The Prep kit builder interface will load (Figure 4).
There are three sections:
Is paired end - select to switch from single end to paired end FASTQ files (Figure 5). If you choose paired end, the First mate will correspond to the _R1 FASTQ file and the Second mate will correspond to the _R2 FASTQ file.
Figure 5. Paired end prep kits have first and second mate segmentation sections
Segmentation - this is where you will describe the structure of your reads
Segments include adaptors, barcodes, UMIs, and the insert (i.e., the target sequence of the assay)
For adaptors, you have the option of choosing a file with your adaptor sequences or entering the adaptor sequences manually.
To use a file, choose File for Sequences and then click Choose File (Figure 6). Use the file browser to choose a FASTA file from your local computer.
You can specify the mismatch allowance using the Mismatches option.
After you have specified the file or manually entered the sequences, click Add to add the adaptor sequence(s).
Unique Molecular Identifiers (UMIs) are randomly generated sequences that uniquely identify an original starting molecule after PCR amplification.
Including a UMI in your prep kit will allow you to access a downstream task that uses UMI information for removing PCR duplicates. For more information about the Deduplicate UMIs task, please see our UMI Deduplication in Partek Flow white paper. Note that while the UMI sequence will be trimmed, a record of the UMI sequence for each read is retained for use by this downstream task.
When adding a UMI segment to your prep kit, you can specify the length of your UMIs (Figure 8).
Adding a barcode segment to a prep kit allows you to access downstream tasks that use barcode information, including Filter barcodes and Quantify barcodes to annotation model (Partek E/M). While the barcode sequence will be trimmed, a record of the barcode sequence for each read is retained for use by downstream tasks.
Like adaptors, barcodes can be specified using a file or manually specified, but you can also choose to designate any segment of arbitrary length in the sequence as the barcode. This is useful if you do not have a specific set of known barcodes.
To set the barcode to an arbitrary segment of fixed length, choose Arbitrary and specify the barcode length (Figure 9).
Remember to click Add to add the new segment to your prep kit.
The insert is the sequence retained after trimming in the Trimmed reads data node. For example, in RNA-Seq, this would be the mRNA sequence. Every prep kit must include an insert segment. You can specify the minimum size of the insert section using the Length field (Figure 10). Reads shorter than the minimum length will be discarded.
Remember to click Add to add the new segment to your prep kit.
Segments are placed from 5' to 3' in the read in the order they are added. You should add the 5' segment first and add additional elements in order of their position in the read. Segments will appear in the Segmentation sections as they are added. You can mouse over a segment to view its details (Figure 11).
For example, the expected read structure (Figure 12) and a completed prep kit for a standard Drop-seq library prep are shown below (Figure 13).
Remove poly-A tail - choose this option to trim poly-A tails from the ends of the read with your insert sequence
Click Next to complete your prep kit
You can manage saved prep kits by going to Home > Settings > Library file management and opening the Prep kit files tab (Figure 14).
Prep kits download as a .zip file. This Prep kit .zip file can be imported into Partek Flow by selecting Import from a file when adding a new prep kit. Select the .zip file when importing, do not unzip the file.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Click to add a segment.
To enter the sequences manually, choose Manual for Sequences then type or paste the adaptor sequences into the text field and click to add the adaptor (Figure 7). You must click for the adaptor sequence to be included. You can remove any adaptor you have added by clicking .
You can add new prep kits from this page by clicking .
You can preview a prep kit by clicking , delete a prep kit by clicking , and download a prep kit to your computer by clicking .