1 of 8

ChIP-Seq Analysis

Chromatin Immunoprecipitation Sequencing (ChIP-Seq) uses high-throughput DNA sequencing to map protein-DNA interactions across the entire genome. Partek Genomics Suite offers convenient visualization and analysis of ChIP-Seq data.

In this tutorial, we will use the Partek Genomics Suite ChIP-Seq workflow to analyze aligned data from a ChIP sample versus a control sample in .bam format.

This tutorial illustrates:

Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.

Description of the Data Set

The data for this tutorial comes from Johnson et al. 2007, which first described the ChIP-Seq technique.

This study mapped genomic binding sites for neuron-restrictive silencer factor (NRSF) transcription factor across the genome. There are two samples: an NRSF-enriched ChIP sample (chip.bam) and a control sample of input DNA without antibody immunoenrichment (mock.bam). The chip.bam file contains ~1.7 million mapped reads and the mock.bam file contains ~2.3 million mapped reads. These .bam files contain the aligned genomic locations and sequences of mapped reads. This data set contains reads from a single-end (SE) library; the differences in processing paired-end (PE) reads will be discussed when applicable.

Data for this tutorial can be downloaded from the Partek website using this link - . To follow this tutorial, download the 2 .bam files and unzip them on your local computer using 7-zip, WinRAR, or a similar program. Because of the large size of the .bam files, we recommend saving them to a local drive instead of trying to access them on a network drive. The first time a .bam file is read by Partek Genomics Suite, the file will be sorted to allow for faster access. Therefore, you must have write permissions for the .bam files after download and on the file folder where they are stored.

References

Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-Wide Mapping of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science.

Additional Assistance

Importing ChIP-Seq data

Data for this tutorial can be downloaded from the Partek website using this link - . To follow this tutorial, download the two .bam files and unzip them on your local computer using 7-zip, WinRAR, or a similar program.

Store the two.bam files at C:\Partek Training Data\ChIP-Seq or to a directory of your choosing. We recommend creating a dedicated folder for the tutorial on a local drive.
Select ChIP-Seq from the Workflows drop-down menu (Figure 1)

Figure 1. Selecting the ChIP-Seq workflow

Select Import and Manage Samples from the Import section of the ChIP-Seq workflow
Select Browse... or use the file tree to navigate to the folder where you stored the .bam files

All .bam files in the folder will be selected by default (Figure 2).

Figure 2. Importing .bam files using the Sequence Import dialog

Verify that chip.bam and mock.bam are selected
Select OK

The Sequence Import dialog will launch (Figure 3). This allows us to choose the output file name and destination for the parent spreadsheet, as well as the species, and genome build of the imported samples. By default, the output file destination is the folder the .bam files are located.

Figure 3. Setting the output file name, species, and genome build during .bam file import

Set Output file to ChIP-Seq
Set Species to Homo sapiens using the drop-down menu
Set Genome build to hg18 using the drop-down menu
Select OK

The Bam Samples Manager dialog will open (Figure 4). This dialog can be used to add samples to the project (Add samples), remove samples (Remove samples), to associate multiple files with particular samples (Manage samples), and to map the chromosome names from the input files to the association files (Manage sequence names).

Figure 4. The Bam Sample Manager can be used to add, remove, and manage files and samples

Select Close

The Sort bam files dialog will open. Sorting is necessary for all imported .bam files, but you can choose to hide this hint in the future by selecting Please don't show me this hint again.

Select OK

The imported spreadsheet will open while the .bam files are sorted. Progress in sorting will be displayed on the progress bar in the lower left-hand corner of the Partek Genomics Suite window. Once sorting has completed, there will be samples on rows with the sample names in column 1. Sample ID and the number of reads mapped to the reference genome for each sample in column 2. Number of allignments (Figure 5).

Figure 5. Imported .bam files with one sample in each row

Additional Assistance

Quality control for ChIP-Seq samples

We can check the quality of the samples using Partek Genomics Suite before analyzing the data.

Strand cross-correlation

In ChIP-Seq, genomic DNA is fragmented and target-protein-bound DNA fragments are purified by immunoprecipitation. These purified fragments are between 100 and 500 base pairs depending on the protocol; however, because ChIP-Seq uses short-read sequencing (25 to 35 base pair reads) to maximize sequencing depth, only the ends of each fragment will be sequenced. Consequently, with single-end sequencing, the forward and reverse strands for the each fragment will be from opposite ends of the fragment. At a protein-binding site, there will be two peaks of read enrichment, one from enrichment of forward strand reads and another from enrichment of reverse strand reads. The average distance between these peaks is termed the effective fragment length. Because the forward and reverse strand peaks are generated from a common set of fragments, the peaks should be roughly symmetrical. By phase shifting the data to the mid-point between the two peaks, a common read density plot can be created that shows single peaks at binding sites.

Strand Cross-Correlation allows us to use the symmetrical distribution of forward and reverse strand fragments calculate the effective fragment length (Kharchenko et al., 2008). The Pearson correlation coefficient between the read densities of the forward and reverse strands is calculated after phase shifts of between 0 and 500 base pairs. This is visualized with the phase shift range on the x-axis and the corresponding Pearson correlation coefficients between forward and reverse strand read densities on the y-axis (Figure 1). High-quality ChIP-Seq data will give a strong peak on the Strand Cross-Correlation plot at the effective fragment length. When calling peaks, the forward and reverse (or paired end) reads are each phase-shifted by the effective fragment length to create a combined read density profile.

For paired-end sequencing, Strand Cross-Correlation is calculated from the distribution of distances between the paired reads from the ends of each fragment.

We will perform Strand Cross-Correlation to identify the effective fragment length we can use when calling read enrichment peaks.

Select Strand Cross-Correlation from the QA/QC section of the ChIP-Seq workflow

If you have not run this step before, you will be asked if you would like to create a new QA/QC child spreadsheet.

If prompted, select Yes to create a new child spreadsheet for QA/QC

After running Strand Cross-Correlation, the Strand Separation of Samples viewer will open as a new tab (Figure 1).

Figure 1. Strand Cross-Correlation profile plot showing possible effective fragment lengths on the x-axis and resulting Pearson correlation coefficients on the y-axis.

For the chip sample (blue), we can see the peak at 111 base pairs, corresponding to an effective fragment length of 111 base pairs. This number can be determined by examining the values in the strand_correlation spreadsheet (Figure 2), by moving the cursor over the peak in the graph, or by sorting the data in the spreadsheet. The Strand Separation of Samples graph is also useful as a quality control measure. In lower quality ChIP-Seq data, we would also observe a peak at the read length. The ratio between the Pearson correlation coefficient of the effective fragment length peak and the read length peak, normalized with the minimum correlation coefficient, [cc(fragment length) - min(cc)] / [cc(read length) - min(cc)] should be greater than 0.8 to meet the minimum quality standards recommended by the ENCODE project (Landt et al., 2012).

The mock sample (red) does not have an effective fragment length peak because it does not read density peaks to phase shift. It does have a small peak at the sequencing read length of 26 base pairs.

Figure 2. The strand correlation spreadsheet shows the Pearson correlation coefficients for each relative strand shift value (effective fragment length)

Checking the distribution of reads

BAM files can contain both aligned and unaligned reads. The spreadsheet created during import shows the number of reads that were aligned to the reference genome. A large number of unaligned reads may be the result of poor quality sequencing data or alignment problems. It may also be useful to know how many reads map to more than one location in the genome if the options used during alignment supported multiple-mapped reads.

Select Alignments per read form the QA/QC section of the ChIP-Seq workflow

A new spreadsheet named Alignment_Counts will be generated (Figure 3).

Figure 3. Unaligned reads have been removed from these BAM files and the alignment options did not permit mapping to more than one location

The titles of columns 2. 0 Single End Alignments Per Read and 3. 1 Single End Alignment Per Read indicate that this is single end data. Column 2 shows the number of unaligned reads, while column 3 shows the number of reads that aligned exactly once. If the BAM files used in this tutorial included reads that mapped to more than one location in the genome, there would be additional columns.

Additional Assistance

Detecting peaks and enriched regions in ChIP-Seq data

Binding sites for the DNA-binding protein of interest are indicated by peaks of enriched sequencing read density. How are peaks calculated from reads in Partek Genomics Suite?

Using the effective fragment length calculated by Cross Strand-Correlation, each read is extended in the 3' direction by the effective fragment length and overlapping extended reads are merged into single peaks. For paired-end reads, the distance between paired reads is used as the fragment length and overlapping fragments are merged into peaks. For peak detection, the genome is divided into windows of a user-defined size and the number of fragments whose mid-points fall within each window is counted. A model for expected read density (a zero-truncated negative binomial) is used to determine which peaks are significantly enriched over a user-defined false discovery rate (FDR). See the ChIP-Seq white paper for more information on the peak-finding algorithm and tips for setting the Fragment extension and window sizes.

Select spreadsheet 1 (ChIP-Seq) from the spreadsheet tree
Select Detect peaks from the Peak Analysis section of the ChIP-Seq workflow

The Peak Detection dialog will open. Configure the dialog as shown (Figure 1).

Figure 1. Configuring the peak detection dialog. The appropriate settings for will depend on your experimental design and data.

Select Maximum average fragment size for Fragment Extensions
Set Maximum average fragment size to 111

Your choice for Maximum average fragment size is based on your experimental design. If you have used an antibody that binds DNA as the control antibody such as an IgG control, you could use different fragment lengths for each sample based on its effective fragment length by selecting the Individual maximum fragment sizes option. Here, we have chosen the effective fragment length of 111 base pairs calculated using Cross Strand-Correlation.

Select Reference sample from Reference sample
Select mock from the Reference sample drop-down menu
Set Set the window size to (base pairs) to 111

The peak detection algorithm divides the genome into windows to find windows with enriched for reads based on an FDR cut-off value. Here, we have chosen to match the window and individual maximum fragment sizes.

Select Overlapping for How should windows be merged?
Set The fraction of false positive peaks allowed to 0.001

The Peak Cut-off FDR determines the cut-off for calling peaks. Setting a lower value demands greater differences between mock and chip samples for a peak to be called; a false discovery rate of 0.001 anticipates 1 false positive per 1000 peaks called.

Select Entire region, spanning all merged windows for Which regions should be reported?

Optimal peak detection settings are dependent on your experimental design and data so fine tuning may be required. Because transcription factor binding sites tend to have localized and sharp clusters of reads, the window size used during the analysis of a transcription factor study can be left relatively small, approximately the same as the average fragment length, and the option to allow for gaps between enriched windows does not need to used. Region in the window with most reads could also be selected to report a more narrow region for each peak call. Conversely, histone modification peaks tend to be subtle and diffuse. To analyze histone modification ChIP-seq data, larger window sizes, combining neighboring windows into larger windows using Within a gap distance of, and reporting entire regions using Entire region, spanning all merged windows might be appropriate.

A convenient way to visualize the relationship between window size and gap size is to select the More info link at the top of the Peak Detection dialog box. A simulated read count histogram will open below the Description of Peak Detection section (Figure 2). The blue bars underneath the histogram will reflect how regions are detected and reported using your current Peak Detection settings. Try changing the How should windows be merged or Which regions should be reported? options to visualize their effects on peak detection.

Figure 2. The visual guide helps show the impact of window size and result reporting settings on peak calling.

Select OK to run the peak detection algorithm with your chosen settings

Peak Detection generates a new child spreadsheet, regions (peaks) (Figure 3).

Figure 3. Peaks spreadsheet lists regions with significant peak enrichment with one row per region.

This spreadsheet is sorted by chromosome number and genomic location. Each row represents one genomic region of peak enrichment. The columns are:

Column 1. Chromosome gives the chromosome location of region

Column 2. Start gives the start of region (inclusive)

Column 3. Stop gives the end of region (exclusive)

Column 4. Sample ID gives the sample containing the enriched region

Column 5. Interval length gives the length of the region, Start - Stop, in base pairs

Column 6. Maximum Extended Reads in Window gives the greatest number of extended reads in any of the windows of a region

Column 7. Reads per Million (RPM) divides the total number of aligned reads in the sample (in millions). This helps you compare peaks across samples, especially when there is a large difference in the number of aligned reads between samples.

Column 8. Mann-Whitney p-value identifies the separation between forward and reverse peaks for single-end reads using the Mann-Whitney U-test. Lower p-values indicate better separation. This p-value can be used if there was no control sample or to eliminate regions called due to PCR bias.

Columns 9-10. Total reads in region gives the total number of non-extended reads for each sample in the given genomic region. One column for each sample.

Column 11. p-value(Sample ID. vs. mock) compares the sample specified in column 4 to the reference sample for this genomic region using a one-tailed binomial test. A low p-value means there are significantly more reads in the sample specified in column 4 than in the mock sample. This column is only included if a reference sample is specified.

Column 12. scaled fold change (Sample ID vs. mock) compares intensity of signal between the sample specified in column 4 to the reference sample in the given genomic region. The fold-change is scaled by a ratio of the number of reads for each sample (IP vs. control) on a per-chromosome basis. Scaled fold changes >1 indicate more enrichment in the IP-sample than in the control sample. This column is only included if a reference sample is specified.

Columns 13 -14. overlap percent gives the fraction of the given genomic region that overlaps a called peak region from the indicated sample. For example, the values of 100% in column 13 and 0% in column 14 indicate regions detected in the chip sample, but not in the mock sample. Similarly, regions with the value of 100% in column 14 were detected in the mock sample.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Creating a list of enriched regions

In this section, we will create a list of peaks significantly enriched in the ChIP sample versus the control sample.

Select Create a list of enriched regions from the Peak Analysis section of the ChIP-Seq workflow
Select Specify New Criteria (Figure 1)

Figure 1. List creator for ChIP-Seq data allows you to create lists using preset or custom criteria

Configure the new criteria as shown (Figure 2).

Name the criteria p-value filtered
Select 1/regions (peaks) from the Spreadsheet drop-down menu
Select 11. p-value(Sample ID vs. mock) from the Column drop-down menu
Select significant with FDR of from the include p-values drop-down menu with a value of 0.05

Figure 2. Creating a criteria that includes regions significantly enriched in ChIP vs. mock

Select OK to add the criteria to the criteria list (Figure 3)

Figure 3. New criteria are added to the criteria list

Select Save
Select p-value filtered from the list of criteria (Figure 4)

Figure 4. Choosing criteria to save as lists

Select OK

The new spreadsheet will open (Figure 5).

Figure 5. Spreadsheet with regions that are significantly enriched in the ChIP sample vs. control

Other List Creator operations like the Venn Diagram, Union (Or), and Intersection (And) of the lists could be used to create different lists of enriched peaks. For example, you could filter on the intersection between Strand Separation FDR of 0.05 and Peaks not in mock or filter by scaled fold change or apply a minimum number of reads per million. The choice of what peaks you want to consider for downstream analysis depends on the goals and details of your experimental design.

Additional Assistance

Identifying novel and known motifs

With a list of enriched regions, you can now identify recurring patterns or motifs in these regions. Transcription factors bind sites throughout the genome, but each has a characteristic sequence it binds - a consensus sequence that appears in most of its binding sites. By searching for binding site motifs, you can determine the consensus sequence for a transcription factor and predict potential binding locations throughout the genome that may not have been found in your experiment.

Partek Genomics Suite detects de novo motifs using the Gibbs motif sampler (Neuwald et al., 1995) and can search for known transcription factor binding sites using a database such as .

Discover de novo motifs

Select Motif Discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Discover de novo motifs
Select OK

The Detect Motifs dialog will open to allow you to configure the search (Figure 1).

Figure 1. Configuring search parameters for de novo motfis

Select 1/p-value_filtered from the Spreadsheet with genomic regions drop-down menu
Set Number of Motifs to 1
Set Discover motifs of length to 6 bp to 16 bp
Set Result file to Motifs
Select OK

If you have not previously downloaded the reference genome on your computer, you may be asked if you would like to download the .2bit reference genome. If prompted, select Automatically download a .2bit file then select OK. If Partek Genomics Suite cannot connect to the internet, this option may not be available. If not, you will need to download the .2bit file from the UCSC Genome Browser and import it by selecting Manually specify a .2bit file and choosing the downloaded .2bit file. The reference genome map is required to determine which genes overlap the enriched peak regions and to display the aligned sequences in the Genome Viewer.

A motif visualization tab, Sequence Logo, will open and two spreadsheets will be generated. One spreadsheet, motifs (Motifs), contains information about the motif. The other, instances (Motifs_instances.txt), lists the genomic locations of the motif.

Description of Motif Detection Output

Sequence Logo Window

The Sequence Logo tab (Figure 2) opens after motif detection and displays the most significant motif found in the regions listed in the source spreadsheet_._

Figure 2. Viewing the binding site for NRSF. Use the blue arrows to cycle through views of all motif found (if there are more than one). Select Reverse to view the reverse complement sequence.

In this case, the motif finder discovered a motif in the NRSF-enriched regions that is 16 base pairs in length. The height of each position is the relative entropy (in bits) and indicates the importance of a base at a particular location in the binding site.

To view the reverse complement of the motif, select Reverse.

Motifs spreadsheet

The motif information spreadsheet (Figure 3), Motifs, lists the information about all motifs discovered during de novo Motif Detection. This includes five columns describing each motif.

Figure 3. Viewing the Motifs spreadsheet

1. Counts gives the summed counts for each base call across all occurrences of the motif in the region list as {A, C, G, T}

2. Consensus Sequence gives the consensus sequence of the motif in IUPAC nucleotide codes

3. Motif ID gives a unique ID to each discovered motif using its row in the Motifs spreadsheet

4. Log Likelihood Ratio scores the relative likelihood that the pattern did not occur by chance, with larger numbers indicating that it is less likely to have occurred by chance

5. Background frequency (A,C,G,T) gives the frequency of each of the bases in all the sequences of that motif

You can bring up the Sequence Logo visualization of a listed motif by right-clicking on the row header and selecting Logo View from the pop-up menu.

Motif_instances spreadsheet

The _instances (_Motif_instances) spreadsheet (Figure 4) is a child spreadsheet of the Motifs spreadsheet. It details all the locations of the motif(s) detected in the enriched regions. Each row lists a putative binding site for a motif. The columns give detailed information about the putative binding sites.

Figure 4. Viewing the instances spreadsheet

1-4. chromosome, start, stop, strand give the position

5. Motif ID gives the identity of the motif

6. instance gives the sequence of this instance of the motif

7. score gives the log ratio of the probability that this sequence was generated by the motif versus the background distribution. A higher number indicates a better chance that the sequence is an instance of the motif.

Search JASPAR for known motifs

Select Motif discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Search for known motifs
Select OK

Search for known motifs will search the JASPAR database for motifs that are over-represented in the list of sequences in the significant regions list. The JASPAR database will download automatically if needed during the Search for known motifs step. Downloading the JASPAR database will create a spreadsheet in your experiment named JASPAR.txt that contains all of the species-specific motifs in the database. To visualize the motifs, right-click on a row in the JASPAR.txt spreadsheet and select Logo View.

Before Search for known motifs runs, we need to configure the search (Figure 5).

Figure 5. Configuring a search for known motifs in the JASPAR database

Select 1/p-value_filtered (p-value filtered.txt) from the Choose Region Spreadsheet drop-down menu
Select Search using motifs specified in: for Choose Motifs to Search
Set Search using motifs specified in: to 2 (JASPAR.txt) using the drop-down menu
Set Search for to All Motifs using the drop-down menu
Set Sequence Quality >= to 0.7
Name the result file MotifSearch
Select OK

Because we are searching for around 1200 motifs, the process will take some time to complete. Progress is displayed in the progress bar in the lower left-hand side of the Search for Motif(s) in Sequences dialog (Figure 6).

Figure 6. Progress in the motif search will display in the progress bar

Two spreadsheets are created, similar to the spreadsheets in the de novo motif discovery, the motif_summary (MotifSearch) spreadsheet (Figure 7) and the motif_instances (MotifSearch.instance) spreadsheet.

Figure 7. Viewing the results of motif search

In the MotifSearch spreadsheet, each motif used in the motif search is shown. The columns detail the results of the search for each motif that was found in the reads.

1. Motif this is the name or ID of the motif

2. Probability of Occurrence gives the probability of detecting a false positive for this motif in a random DNA sequence

3. Expected Number of Outcomes gives the Probability of Occurrence multiple by the summed length of the reads

4. Actual Number of Occurrences gives a count of sequences that match the known motif in the reads

5. p-value is the uncorrected p-value (binomial test)

As you can see, REST, which is another name for NRSF, is near the top of the list as one of the most significantly over-represented motifs (Figure 7). This motif agrees with the motif found in the de novo motif detection step. Interestingly, other motifs appear a significant number of times in the ChIP-Seq peaks and may represent possible co-factors or regulators.

The motif_instances spreadsheet contains all instances of the motifs from the motif_summary spreadsheet in a format identical to the instances spreadsheet from de novo motif detection.

Generating a list of regions containing a motif

While the motif_instances spreadsheet contains every instance of every motif, it may be useful to create a spreadsheet with just instances of one motif or a select group of motifs. Let's do this for both REST motifs.

Select the motif_instances spreadsheet in the spreadsheet tree
Right-click the 5. Motif Name column
Select Find / Replace / Select... from the pop-up menu (Figure 8)

Figure 8. Finding all REST peaks (step 1)

Set Find What: to REST
Select By Columns for Search:
Select Only in column with 5. Motif Name selected form the drop-down menu
Select Select All (Figure 9)

Figure 9. Selecting all REST instances in motif_instances spreadsheet (step 2)

This finds and selects every instance of REST in column 5. Motif Name.

Select Close

In the motif_instances spreadsheet the selected columns are highlighted.

Right-click on the first highlighted row visible; in this example, we see row 13196
Select Filter Include from the pop-up menu (Figure 10)

Figure 10. Filtering for selected rows

The spreadsheet will now include 2098 rows and a black and yellow bar will appear on the right-hand side of the spreadsheet (Figure 11). The black and yellow bar is a filter indicator showing the fraction of the spreadsheet currently visible as yellow and the filtered fraction as black.

Figure 11. Filtered motif_instances spreadsheet containing 2098 instances of the REST motifs

To create a spreadsheet that contains only the REST instances, we can clone the motif_instances spreadsheet while the filter is applied.

Right-click on motif_instances in the spreadsheet navigator
Select Clone... from the pop-up menu
Set the Name of resulting copy as REST
Select 1/p-value_filtered/motif_summary (MotifSearch) from the Create as a child of spreadsheet drop-down menu
Select OK

This creates a temporary spreadsheet rest from the filtered motif_instances spreadsheet. We can now save the new spreadsheet.

Select rest from the spreadsheet tree
Name the file REST
Select Save

We can now remove the filter from the source motif_instances spreadsheet.

Select motif_instances from the spreadsheet tree
Right-click the filter bar
Select Clear Filter

References

Neuwald, A. F., Liu, J.S., & Lawrence, C.E. (1995). Gibbs motif sampling: detection of outer membrane repeats (Vol. 4). Protein Science.

Additional Assistance

Finding nearest genomic features

In this section, you will learn how to find genomic features (genes) that are near the IP-enriched regions of the data. You will also learn how to classify the peak locations by gene section (5’ UTR, 3’ UTR, Promoter, exon, intron).

Finding the nearest genomic features

Select p-value_filtered from the spreadsheet tree
Select Find Nearest Genomic Feature from the Peak Analysis section of the ChIP-Seq workflow

The Output Overlapping Features dialog will open (Figure 1).

Figure 1. Selecting a database for overlapping features

With this dialog, you can specify the reference database.

Select RefSeq Transcripts 81 - 2017-08-02 or your preferred annotation database

The promoter region can also be defined. The default settings are appropriate in this case.

Select OK

The resulting spreadsheet, gene-list, is a child of the p-value_filtered spreadsheet (Figure 2). Each row represents a transcript.

Figure 2. Viewing genes overlapped by regions

Column 1. transcript chromosome gives the chromosome location of transcript

Column 2. transcript start gives the start of transcript (inclusive)

Column 3. transcript stop gives the end of transcript (exclusive)

Column 4. strand gives the strand of the transcript

Column 5. Transcript ID gives the identify of the transcript

Column 6. Gene Symbol gives the identity of the gene

Column 7. Distance to TSS gives the distance of each enriched region to the transcription start site in base pairs with positive indicating downstream and negative indicating upstream

Column 8. Percent overlap with gene gives the percent of the gene that overlaps with the region

Column 9. Percent overlap with region gives the percent of the region that overlaps with the gene

Percent overlap with gene is more likely to close to 1 in cases where one region covers several genes, in histone studies, for example. Percent overlap with region is likely to be close to 1 in cases where a region is relatively small and is found completely within a gene, in transcription factor binding studies, for example. If both columns are close to 1, then the gene and the region have nearly the same start and stop sites. If both columns are close to 0, then the region does not overlap with the gene directly and likely covers only the promoter region.

Classifying regions by gene section

Another way to interpret the genomic location of peaks is to use Classify regions by gene selection.

Select p-value_filtered from the spreadsheet tree
Select Classify regions by gene selection from the Peak Analysis section of the ChIP-Seq workflow

The Output Overlapping Features dialog will open.

Select RefSeq Transcripts 81 - 2017-08-02 or your preferred annotation database

The promoter region can also be defined. The default settings are appropriate in this case. The results can be further configured to give one result per detected region or one result per genomic feature. The default setting, one result per detected region, is appropriate in this case.

Select OK

A new spreadsheet, gene-classification will be generated (Figure 3).

Figure 3. Classifying regions by gene section

Columns 1-6 have the same contents we saw in gene-list.

Column 7. Gene Section gives the section of the gene that overlaps with the region

Column 8. Distance to TSS gives the distance of each enriched region to the transcription start site in base pairs with positive indicates downstream and negative indicating upstream

Column 9. Distance to nearest gene gives the distance of each enriched region to the nearest gene in base pairs with positive indicating downstream and negative indicating upstream

Column 10. Sample ID gives the sample in which the region is enriched

Additional Assistance

Visualizing reads and enriched regions

We can visualize the reads and enriched regions using the Partek Genome Viewer.

Select ChIP-Seq from the spreadsheet tree
Select Chromosome View from the Visualization section of the ChIP-Seq workflow

The Chromosome View tab will open. The default tracks are the transcript tracks, the sequence visualization tracks, and the cytoband track (Figure 1).

Figure 1. Viewing ChIP-Seq data in Chromosome View

We can add additional tracks to view the results of our analysis.

Select New Track from the left-hand side of the Tracks tab
Select Add tracks from a list of spreadsheets
Select Next >
Select the p-value_filtered.txt and Motifs_instances.txt tracks on the Tracks panel
De-select Aligned Reads
Select Create (Figure 2)

Figure 2. Adding spreadsheet tracks to Genome Viewer

This will display the enriched regions found in the samples and the locations of the motif instances from the de novo motif discovery (Figure 3).

Figure 3. Adding p-value and motif binding tracks

The two track display the detected regions at each location on the chromosome for the NRSF-enriched sample, chip, and aligned them to the de novo discovered motif binding sites. If you have not gone through the steps for peak detection and motif discovery, these tracks will not be available.

Select RefSeq Transcripts - 2016-10-18 (hg18) (+) on the Tracks panel
Set Strand to Both using the drop-down menu in the Track section of the track properties panel
Select Apply

The RefSeq Transcripts track now shows transcripts from both strands. We can remove the dedicated (-) strand track.

Select RefSeq Transcripts - 2016-10-18 (hg18) (-) on the Tracks panel
Select Remove Track

Several other tracks will not be used in our analysis; we can remove them as well.

Select Legend: Base Colors, Genome Sequences (hg18), and Cytoband (hg18) on the Tracks panel by left clicking on each while holding the Ctrl key on your keyboard
Select Remove Track

The height of many tracks can be adjusted by changing the Track height setting in the Profile tab.

Select Regions (1/p-value_filtered (p-value filtered.txt)) on the Tracks panel
Set Track height to 25 using the slider
Select Apply
Repeat for Regions (1/p-value_filtered/motifs/instances (Motifs_instances.txt))

This view clearly shows the Bam Profile tracks for chip and mock samples alongside smaller track indicating motif binding sites, significant peaks, and RefSeq transcripts (Figure 4).

Figure 4. Genome Viewer tracks can be modified to facilitate data exploration and analysis

We can also see strand-specific information by color-coding forward and reverse strand read information on the Bam Profile (chip) track.

Select Bam Profile (chip) on the Tracks panel
Select Alignments in the Display Color Options section of the Style tab
Select Strands under Color by
Select Apply

A legend track will be added at the bottom of the viewer showing that forward strand reads are colored green and reverse strand reads are colored red. We can move this track to below Bam Profile (chip).

Left-click Legend: strands in the Tracks panel and drag it up the list to below Bam Profile (chip) (Figure 5)

Figure 5. Move tracks by left-clicking and dragging them in the Tracks panel

Chromosome View opens to a whole-chromosome view of chromosome 1. To analyze the data we can zoom in on the data (Figure 6).

Figure 6. Viewing the zoomed-in view of an enriched region showing two possible binding sites at the location

We can practice this using a gene highlighted in the paper describing this data set, NEUROD1.

Type NEUROD1 in the navigation bar
Select Enter on your keyboard to navigate to NEUROD1 (Figure 7)

Figure 7. Zooming to the NEUROD1 gene

NEUROD1 contains a binding site for the NRSF motif. Notice that the enriched region for the NRSF transcription factor is within the NEUROD1 gene. As discussed in the Johnson et al. paper, NRSF is implicated in the repression of NEUROD1, but it was unknown exactly where the NRSF binding occurred. This data indicates that the binding site is within the NEUROD1 gene itself, as shown by the orange box in the Regions track. We can zoom in further to view the forward and reverse strand read histogram.

Left-click and drag to draw a box around the predicted binding site (Figure 8)

Figure 8. Using Zoom/Navigate Mode. Drawing a box around a region on any track will zoom all tracks to that region

Zoomed-in further, we can see the intersecting peaks from forward and reverse strand reads (Figure 9).

Figure 9. Zoomed-in view of NEUROD1 binding site for NRSF

Additional Assistance

Detecting peaks and enriched regions in ChIP-Seq data

Binding sites for the DNA-binding protein of interest are indicated by peaks of enriched sequencing read density. How are peaks calculated from reads in Partek Genomics Suite?

Select spreadsheet 1 (ChIP-Seq) from the spreadsheet tree
Select Detect peaks from the Peak Analysis section of the ChIP-Seq workflow

The Peak Detection dialog will open. Configure the dialog as shown (Figure 1).

Figure 1. Configuring the peak detection dialog. The appropriate settings for will depend on your experimental design and data.

Select Maximum average fragment size for Fragment Extensions
Set Maximum average fragment size to 111

Select Reference sample from Reference sample
Select mock from the Reference sample drop-down menu
Set Set the window size to (base pairs) to 111

Select Overlapping for How should windows be merged?
Set The fraction of false positive peaks allowed to 0.001

Select Entire region, spanning all merged windows for Which regions should be reported?

Figure 2. The visual guide helps show the impact of window size and result reporting settings on peak calling.

Select OK to run the peak detection algorithm with your chosen settings

Peak Detection generates a new child spreadsheet, regions (peaks) (Figure 3).

Figure 3. Peaks spreadsheet lists regions with significant peak enrichment with one row per region.

This spreadsheet is sorted by chromosome number and genomic location. Each row represents one genomic region of peak enrichment. The columns are:

Column 1. Chromosome gives the chromosome location of region

Column 2. Start gives the start of region (inclusive)

Column 3. Stop gives the end of region (exclusive)

Column 4. Sample ID gives the sample containing the enriched region

Column 5. Interval length gives the length of the region, Start - Stop, in base pairs

Column 6. Maximum Extended Reads in Window gives the greatest number of extended reads in any of the windows of a region

Columns 9-10. Total reads in region gives the total number of non-extended reads for each sample in the given genomic region. One column for each sample.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Identifying novel and known motifs

Partek Genomics Suite detects de novo motifs using the Gibbs motif sampler (Neuwald et al., 1995) and can search for known transcription factor binding sites using a database such as .

Discover de novo motifs

Select Motif Discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Discover de novo motifs
Select OK

The Detect Motifs dialog will open to allow you to configure the search (Figure 1).

Figure 1. Configuring search parameters for de novo motfis

Select 1/p-value_filtered from the Spreadsheet with genomic regions drop-down menu
Set Number of Motifs to 1
Set Discover motifs of length to 6 bp to 16 bp
Set Result file to Motifs
Select OK

Description of Motif Detection Output

Sequence Logo Window

The Sequence Logo tab (Figure 2) opens after motif detection and displays the most significant motif found in the regions listed in the source spreadsheet_._

Figure 2. Viewing the binding site for NRSF. Use the blue arrows to cycle through views of all motif found (if there are more than one). Select Reverse to view the reverse complement sequence.

The title CT.TCC..GGT.CTG. is the consensus sequence for the sequence logo. Dots represent positions that contain more than one significant base across all reads in the motif. The dots can be replaced with characters representing the possible bases at each location by selecting Show nucleotide codes. A description of the IUPAC nucleotide codes is available at the .

To view the reverse complement of the motif, select Reverse.

Motifs spreadsheet

The motif information spreadsheet (Figure 3), Motifs, lists the information about all motifs discovered during de novo Motif Detection. This includes five columns describing each motif.

Figure 3. Viewing the Motifs spreadsheet

1. Counts gives the summed counts for each base call across all occurrences of the motif in the region list as {A, C, G, T}

2. Consensus Sequence gives the consensus sequence of the motif in IUPAC nucleotide codes

3. Motif ID gives a unique ID to each discovered motif using its row in the Motifs spreadsheet

4. Log Likelihood Ratio scores the relative likelihood that the pattern did not occur by chance, with larger numbers indicating that it is less likely to have occurred by chance

5. Background frequency (A,C,G,T) gives the frequency of each of the bases in all the sequences of that motif

You can bring up the Sequence Logo visualization of a listed motif by right-clicking on the row header and selecting Logo View from the pop-up menu.

Motif_instances spreadsheet

Figure 4. Viewing the instances spreadsheet

1-4. chromosome, start, stop, strand give the position

5. Motif ID gives the identity of the motif

6. instance gives the sequence of this instance of the motif

Search JASPAR for known motifs

Select Motif discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Search for known motifs
Select OK

Before Search for known motifs runs, we need to configure the search (Figure 5).

Figure 5. Configuring a search for known motifs in the JASPAR database

Select 1/p-value_filtered (p-value filtered.txt) from the Choose Region Spreadsheet drop-down menu
Select Search using motifs specified in: for Choose Motifs to Search
Set Search using motifs specified in: to 2 (JASPAR.txt) using the drop-down menu
Set Search for to All Motifs using the drop-down menu
Set Sequence Quality >= to 0.7
Name the result file MotifSearch
Select OK

Figure 6. Progress in the motif search will display in the progress bar

Figure 7. Viewing the results of motif search

In the MotifSearch spreadsheet, each motif used in the motif search is shown. The columns detail the results of the search for each motif that was found in the reads.

1. Motif this is the name or ID of the motif

2. Probability of Occurrence gives the probability of detecting a false positive for this motif in a random DNA sequence

3. Expected Number of Outcomes gives the Probability of Occurrence multiple by the summed length of the reads

4. Actual Number of Occurrences gives a count of sequences that match the known motif in the reads

5. p-value is the uncorrected p-value (binomial test)

The motif_instances spreadsheet contains all instances of the motifs from the motif_summary spreadsheet in a format identical to the instances spreadsheet from de novo motif detection.

Generating a list of regions containing a motif

Select the motif_instances spreadsheet in the spreadsheet tree
Right-click the 5. Motif Name column
Select Find / Replace / Select... from the pop-up menu (Figure 8)

Figure 8. Finding all REST peaks (step 1)

Set Find What: to REST
Select By Columns for Search:
Select Only in column with 5. Motif Name selected form the drop-down menu
Select Select All (Figure 9)

Figure 9. Selecting all REST instances in motif_instances spreadsheet (step 2)

This finds and selects every instance of REST in column 5. Motif Name.

Select Close

In the motif_instances spreadsheet the selected columns are highlighted.

Right-click on the first highlighted row visible; in this example, we see row 13196
Select Filter Include from the pop-up menu (Figure 10)

Figure 10. Filtering for selected rows

Figure 11. Filtered motif_instances spreadsheet containing 2098 instances of the REST motifs

To create a spreadsheet that contains only the REST instances, we can clone the motif_instances spreadsheet while the filter is applied.

Right-click on motif_instances in the spreadsheet navigator
Select Clone... from the pop-up menu
Set the Name of resulting copy as REST
Select 1/p-value_filtered/motif_summary (MotifSearch) from the Create as a child of spreadsheet drop-down menu
Select OK