Detecting DNA Base Modifications

Transcription

1 Detecting DNA SMRT Analysis of Microbial Methylomes Background Microbial genomes contain a variety of base modifications, most commonly occurring as methylation at adenine or cytosine residues. These modifications typically arise from RM (restrictionmodification) systems which serve as a defense mechanism in microbes, protecting the cell from invading bacteriophages or other foreign DNA 1. In general, an RM system comprises both a restriction enzyme (endonuclease) and a methylase that are specific to the same sequence motif. The endonuclease recognizes and cleaves at the motif in order to degrade foreign DNA. The bacteria s own DNA is protected from degradation, because the same sequence motif has been methylated, thus preventing cleavage by the restriction endonuclease. DNA modifications are also known to control other biological processes in bacteria, such as cell cycle and DNA replication, mismatch repair, gene expression, and pathogenicity 2,3. Full characterization of methylomes has been challenging, because motifs that are targeted for modification are extremely variable within and across species of bacteria, and most species contain more than one RM system 4. This is further complicated by bacterial conjugation, which allows horizontal transfer of mobile genetic elements between bacteria 5,6. With SMRT Sequencing, it is now possible to discover 7 modifications of novel sites, on a genome-wide scale, in a particular strain of bacteria. It can also be used to determine the modification sites for particular methyltransferases 8. In this document, we outline methods for base modification detection in microbial genomes using the SMRT Analysis software. We provide guidelines for sample preparation, sequencing, and downstream analyses, including the detection of modifications and subsequent motif analysis. The document assumes an understanding of base modification detection fundamentals as described in the Pacific Biosciences White Paper Detecting DNA Using SMRT Sequencing. Throughout this technical note we provide examples of each step. The data we have used for the examples is available on DevNet at Open the data set home page labeled Normal E. coli and then download the files Technote E coli Native Raw Reads HDF and E coli K12 MG1655 Mutated (FASTA). Import the SMRT Cell data from the Raw Reads HDF file into SMRT Portal, and import the FASTA file as a new reference sequence (see the SMRT Portal Online Help or the Secondary Analysis Web Services API for more information). You will then be ready to follow the examples in this document. Over time, new data sets will be posted to the web site that may also be used for practice with the analysis process. Methods When designing an experiment to detect base modifications on the PacBio RS, the number and types of SMRTbell libraries and the number of SMRT Cells are dictated by: (1) the amount of input DNA available; (2) the method used to calculate interpulse duration (IPD) ratios; (3) the particular modification you are analyzing; (4) the size of the genome and (5) whether Tet conversion is used to magnify the signature of 5-methylcytosine. These factors underlie the current coverage recommendations. For a description of basic terminology, see the Pacific Biosciences White Paper Detecting DNA Using SMRT Sequencing. Experiment Design Isolate DNA Template Preparation Sequencing Analysis Page 1

2 1. The amount of input DNA available. Adequate amounts of native (i.e., unamplified) DNA are required for methylation detection, since amplification effectively erases base modifications. When considering working with a limited sample, the coverage needs for the modification of interest are important. Smaller size inserts require less starting DNA for generating adequate sequence coverage for kinetic analysis. However, they will not be adequate for de novo assembly unless they are used in conjunction with a long insert library. 2. The method used to calculate IPD ratios. Base modification analysis, using kinetics, relies on sequence context normalization of the kinetics values. Currently, the primary metric used for this analysis is interpulse duration (IPD). This corresponds to the time required for a new base to bind in the active site of the sequencing polymerase after the previous base has been incorporated. We normalize by calculating the ratio of the IPD in the sample of interest to the IPD of a control to determine the IPD ratio. The default analysis mode, and the focus of this document, is to use a polymerase kinetics computational model to calculate IPD ratios. The computational model is called the in silico control. Internal studies at PacBio have shown that the IPD for a particular base incorporation depends on a sequence context spanning approximately 12 bases, which matches the binding footprint of the DNA polymerase. Currently, SMRT Analysis only supports modification identification using the in silico control. When using the in silico control, the detection accuracy may be increased by activating modification identification in SMRT Analysis. This analysis compares the modification signal to an additional computational model (of the expected positive signature) for three modification types: 6-mA, 4-mC, and Tetconverted 5-mC. Alternatively, IPD ratios may be calculated using an amplified control. This control is created by separately sequencing an amplified version of the genomic sample of interest, which will have base modifications erased by the amplification process. The amplified control will take an extra sequencing library to generate, but it can produce a lower background (statistical noise) than the in silico control as long as there is more than 80X coverage, per strand, of the amplified control library. This type of control may be useful when studying modifications other than 6-mA, 4- mc, and 5-mC. However, this type of control is not currently compatible with the modification identification feature of SMRT Analysis. Finally, IPD ratios may be calculated by comparing two different samples. In this case, two native DNA samples are sequenced separately and then compared to each other to detect differential modifications resulting from various growth conditions, or strain-to-strain comparisons of bacteria. Using this protocol, modifications shared between the strains will go undetected. In order to perform differential analysis of identified modifications and motifs, it is easiest to perform separate analyses using the in silico control for each condition or strain and then comparing the two outputs to each other. Using a native DNA control will help to locate regions of differential modifications, but it is not currently compatible with modification identification in SMRT Analysis. 3. The particular modification you are analyzing. Coverage requirements vary with modification type, due to differences between their kinetic signatures. For example, N 6 -methyladenine (6- ma) and N 4 -methylcytosine (4-mC) provide strong kinetic signals, while the signal from native 5-methylcytosine (5-mC) is weaker. Therefore, detecting native 5-mC requires higher coverage to achieve reliable detection. However, by using a Tet enzyme to oxidize 5- mc to 5-caryboxylcytosine (5-caC), the modification signal is enhanced to a level comparable to that of 6-mA and 4-mC 9. Page 2

3 For example, reliable detection of 6-mA, 4-mC or Tet-converted 5-mC requires approximately 25X coverage per strand. However, because of the smaller and more dispersed kinetic signature of native 5-mC, at least 10-fold higher coverage (250X) per strand is recommended for detection in that case. More information on the relationship between coverage and modification types and the accuracy of calls can be found in the Pacific Biosciences - Detecting and Identification of with Single Molecule Real Time Sequencing Data. Since SMRT Sequencing coverage across a genome closely follows the expected Poisson distribution, we recommend targeting an average of ~100X total coverage to ensure the lowest covered regions of the genome meet the threshold of 25X coverage per strand. Note, that if an amplified control is being used, the coverage requirements apply to both libraries (i.e., the native DNA sample and the amplified DNA control sample should each be sequenced to a total average coverage of 100X). 4. The size of the genome. The size of a genome, the SMRTbell library insert size, and the coverage required to detect the modification(s) of interest directly impact the number of SMRT Cells required. As an example, the 5 Mb E. coli genome would require 4-6 SMRT Cells for reliable detection of 6-mA, 4- mc or Tet-converted 5-mC. In this example, each SMRT Cell is assumed to yield approximately 100 Mb of mapped sequence. 5. Whether you are using Tet conversion to identify 5-mC. It has recently been shown that Tet will convert 5-mC to 5-caC 9. Used in SMRT Sequencing, this will amplify the kinetic signal and reduce the coverage required to detect 5-mC to a level similar to 6-mA or 4-mC, as noted in part 3 above. It is important to note that the current protocols for Tet conversion of 5-mC have off-target effects on 4-mC in some sequence contexts. Therefore, in order to simultaneously detect 4- mc and 5-mC, it is necessary to run a native sample separately from a Tet-converted sample. If only 6-mA and 4-mC are of interest, only the native sample needs to be run. If only 6-mA and 5-mC are of interest, only the Tet-converted sample needs to be run. Sample Preparation and PacBio RS Run Parameters Base modification detection requires that libraries be constructed from native genomic DNA. If the in silico control will be used, data must be generated using Sequencing Kit 2.0 (or C2 chemistry). If an analysis will be performed with an amplified control or native control, both samples must be sequenced using the same sequencing kit version. If Tet1 conversion is performed to identify 5-mC sites, see the Shared Protocol Guidelines for Using WiseGene 5-mC Tet1 Oxidation Kit for SMRT Sequencing on Sample Net ( This conversion should be done prior to SMRTbell library preparation. There are no additional requirements for preparing SMRTbell libraries for base modification detection. Multi-molecule analysis is adequate to perform motif analysis (i.e., there is no need to generate high Circular Consensus Sequence (CCS) coverage when analyzing bacterial methylomes). Sample preparation and PacBio RS run parameters will be influenced by other elements of the experimental design, including whether a de novo or resequencing workflow is used. SMRTbell Library Preparation If a de novo assembly (i.e., creation of a new genome reference sequence) of the microbe is being generated as part of the same experiment, libraries of longer inserts (at least 10 kb) are recommended to support accurate assembly. For more information, see the Pacific Biosciences Technical Note - De Novo and Hybrid Assembly. For a resequencing experiment (i.e., an alignment to a known reference), the key requirement is that the resulting reads be long enough to accurately map to the reference sequence, and that enough DNA is Page 3

4 available to construct libraries with inserts of a particular size. DNA damage repair will not affect modifications in the DNA such as 6-mA, 4-mC, 5- hmc, and 5-mC. For instructions on preparing SMRTbell libraries, see the Pacific Biosciences Guide - Template Preparation and Sequencing. When an experiment is designed to use an amplified control for IPD ratio analysis, a separate SMRTbell library must be prepared from whole-genome amplified (WGA) genomic material using a third party kit (e.g., REPLI-g from QIAGEN). This amplified control library will be the baseline for kinetic analyses and is run separately from the test library. PacBio RS Run Parameters Long inserts (greater than 3 kb) should be sequenced using a 1x90 minute movie protocol. Shorter inserts (less than 3 kb) should be sequenced using a 2x45 minute movie protocol. Longer inserts may also be sequenced using a 2x45 minute movie protocol to increase the data per SMRT Cell. Example: To examine adenine methylation across the 5 Mb E. coli genome strain K-12 substrain MG1655, we conducted a resequencing experiment. We created a 1 kb SMRTbell library using C2 chemistry for use in a 2x45 minute movie protocol. To target 100X coverage, we sequenced five SMRT Cells and achieved a mean coverage of 162X. The data generated is available at To prepare 1 kb libraries, see the Pacific Biosciences Procedure & Checklist - 1 kb Template Preparation and Sequencing. Methylome Analysis Methylation and motif analysis is done directly in SMRT Portal v1.3.3 using a new protocol called RS_Modification_and_Motif_Analysis.1, which performs the following steps: 1. Aligns SMRT sequencing subreads to a reference genome, producing a cmp.h5 alignment file. 2. Detects variants, producing a variant GFF track and VCF track. 3. Stores data in the same format regardless of which type of normalization control (amplified, native, or in silico control) was used to calculate the IPD ratios. 4. Generates modifications.csv and modifications.gff files. These files contain statistics on the polymerase kinetics during sequencing (at every position in the genomic sample). High IPD ratio positions in these files represent locations of putative modification. Note that some modifications will have high IPD ratios at multiple sites within the 12-base polymerase footprint. 5. Analyzes the recurring context of modifications across the genome and creates a summary report of these modified motifs in motif_summary.csv. 6. Identifies locations (using the information from step 4) of 6-mA, 4-mC, and Tet-converted 5-mC and combines them with the motif information determined in step 5. In the modifications.gff file, secondary kinetic variation events are removed if modification identification is turned on. The analysis also outputs the combined modification and motif information in a motifs.gff file. For example, a secondary +5 peak for a 6-mA will be removed so that only the single identified base is represented. Two off-site peaks for Tet-converted 5- mc will be removed, and a single correct site for the modification will replace it. This clean-up process occurs for both the modifications.gff and motifs.gff files. Note that the RS_Modification_and_Motif_Analysis.1 protocol requires a reference assembly. This reference must be uploaded to SMRT Portal before setting up the job. For experiments comprising both de novo assembly and base modification detection, the assembly must be generated first, before being used as a reference. Be aware that low-quality areas of assemblies or variable sequence regions in the known reference can result in apparent modification calls since the true sequence context of the base calls will not match the expected sequence context that the in silico model is using. Use of low quality or highly variable reference sequence typically results in poor mapping of the sequence reads to the reference sequence and difficulty detecting base modifications. Regions of low Map QV are a strong Page 4

5 indicator that reads have been mapped to a repeat region. If using an amplified control for IPD ratio analysis, a separate SMRT Portal job must be run to align the unmodified DNA sequence to the reference. This alignment may be performed with any protocol that performs resequencing alignment, but the same reference sequence must be used for both the amplified control and the base modification analysis SMRT Portal job. Identify Modifications is enabled, the software will ignore the identify modifications check box. If using the in silico control, leave the Control Job ID field blank. 4. If using Tet-converted DNA, select Sample is Tet treated Identify M5C Tet Modifications to identify 5-mC. Setting Up the Job After selecting SMRT Cells for analysis in the Design Job tab of SMRT Portal: 1. Select RS_Modification_and_Motif_Analysis.1 from the Protocol drop-down menu and click the button. Figure 1. RS_Modification_Detection.1 Protocol Selection 2. Select the appropriate (previously-uploaded) reference from the Reference drop-down menu (see Figure 2). For instructions on how to do this, please see the SMRT Portal help section Managing reference sequences within SMRT Portal. Figure 3. Entering the Control Job ID Example: To analyze 6-mA in E. coli, we used the default in silico IPD normalization process with an edited version of the MG1655 clone of K12 strain as the reference (named ecoli_mutated and available with the sequence data files on These edits were made to correct for variants in the E. coli strain that was used as compared to the available reference sequence. Output Files Two reports and four data files are generated by the RS_Modification_and_Motif_Analysis.1 analysis protocol. They are available as compressed.gzip files (gz) and can be downloaded from the SMRT Portal Job Details Page (in the DATA section). The two reports are called 1) Modifications and 2) Motifs. Modifications indicates which bases have a modification via two graphics (see Figure 4). Figure 2. Reference Selection 3. Select Postprocessing and then enter the Job ID for the control alignment in the ControlJobID box, if an amplified control or native control was run. Note that modification identification does not currently support use of an amplified or native control. If a Control Job ID is entered and Figure 4. Modifications Report Page 5

6 The modification QV vs Coverage scatterplot will show modified bases as distinct clouds (which have a higher than expected modification QV at a particular coverage). In the example figure above, the red adenines are distinctly separate. A similar adenine methylation indication is depicted by the Modification QV Histogram, where the red line for the adenine bases is again distinct from the other three bases. Note that these data do not incorporate information from identifying specific types of base modification, so kinetic signatures that spread over multiple bases will cause several clouds or lines to diverge from the background. That signal is evidenced above in the bulge of higher modification QV for the C, G, and T bases due to the +5 secondary peak in many contexts of 6-mA. This would also arise with 5-mC for example, because the strongest kinetic signals are two and six bases from the site of modification. Note that in the case of modified 5-mC, a distinct separation of the C base will not necessarily appear, because there is little signal at the site of modification. The largest signal is two bases away from the site of modification. These two reports both display single-site modification QVs and only incorporate kinetics information, not the modification identity. The Motifs report is a summary table of the motifs, and is described in more detail in the Performing Motif Analysis section below. See Figure 13 below. The four data file outputs are as follows: The modifications.csv file is a comma-separated values (CSV) file (see Table 1 and Table 3) with statistical analysis of each position in the reference. It is intended to allow additional follow-up analysis for every genomic position. Note that when analyzing the subreads, all IPDs for a subread are normalized by the mean IPD of that subread, which handles read-to-read variation in IPDs. This file is also produced when motif analysis is not active, such as with the RS_Modification_Detection analysis protocol. The modifications.gff file is a General Features Format (GFF) file (see Table 2). The GFF file is used for motif analysis and modification visualization in SMRT View. The GFF file is a text file formatted for graphical sequence viewers. It includes sequence contexts only for sites of putative modification defined as positions with p-values of 0.01 or less, which indicate that the IPD ratio (at the position) is significantly different from the expected background. This file is also produced when motif analysis is not active. With modification identification turned on, secondary kinetic variation events will be removed from this file for example, if a 6-mA with a secondary +5 signal is identified, the secondary signal will be removed from this file. motif_summary.csv contains the genome-wide summary of the methyltransferase recognition motifs discovered in this sample. The motifs.gff file is similar to the modifications.gff file, but is produced when motif analysis is active. This file contains information about all sites detected as modified, all locations of a discovered motif, and also the overlap between the modifications and motifs. With modification identification turned on, secondary peaks are removed as in the modifications.gff file. Many genomic viewers will be able to open the GFF file (with a small edit) to be sure that the reference sequence identifiers match the sequence identifiers in the GFF. SMRT View, covered in the next section, has been enhanced to take advantage of specific features of the files produced by this analysis protocol. Page 6

7 Table 1: Fields Included in the modifications.csv file when using the in silico control. Column refid Description Reference sequence tag for this observation. Same as Seqid in the.gff file. This is an internal identifier for a reference FASTA sequence. A mapping of this ID to the labels in the FASTA file is contained in the reference.info.xml file stored with the reference file on the SMRT Portal server. Tpl Template position, starting at 1. Strand Native sample strand where kinetics were generated. 0 is the strand of the original FASTA and 1 is the reverse complement of the strand. Note that in the.gff file these are marked "+" and "-" respectively. Base The letter representing this base: A, C, T, G. Score tmean -10 log (p-value) score for the detection of this event. Analogous to a Phred quality score. A value of 20 is the minimum default threshold for this file, and corresponds to a p-value of A score of 30 corresponds to a p- value of Capped mean of IPDs observed at this position. Capped means that outlier data points are removed - this will reduce the impact of random polymerase pausing events. Numerator of IPD ratio. terr ModelPrediction ipdratio Coverage Capped standard error of IPDs observed at this position (standard deviation/qrt (coverage)). Normalized mean IPD predicted by the in silico control model for this sequence context. Denominator of IPD ratio. tmean/modelprediction. Count of valid IPDs at this position (see Filtering section for details). Table 2: Fields included in the modifications.gff file. Column Seqid Source Type Start End Score Strand Phase Attributes Description Reference tag (e.g. ref00001). Same as refid in the.csv file. Name of tool -- "kinmodcall". Modification type a generic tag "modified_base" is used for unidentified bases. For identified bases, m6a, m4c, and m5c are used. Location of modification. Location of modification. -10 log (p-value) score for the detection of this event. Analogous to a Phred quality score. A value of 20 is the minimum default threshold for this file, and corresponds to a p-value of A score of 30 corresponds to a p- value of Native sample strand where kinetics were generated. + is the strand of the original FASTA and - is the reverse complement of the strand. Note that in the.csv file these are marked "0" and "1" respectively. Not applicable. Contains extra fields. IPDRatio is traditional IPD Ratio, context is the reference sequence -20bp to +20bp around the modification plus the base at this location as the 21 st character, and sequencing coverage of that position. Context is always written in 5 -> 3 orientation of the template strand. Page 7

8 Table 3: Fields included in the modifications.csv file when using an amplified control or native control. Column refid Description Reference sequence tag for this observation. Same as Seqid in the GFF file. This is an internal identifier for a reference FASTA sequence. A mapping of this ID to the labels in the FASTA file is contained in the reference.info.xml file stored with the reference file on the SMRT Portal server. Tpl Template position, starting at 1. Strand Native sample strand where kinetics were generated. 0 is the strand of the original FASTA and 1 is the reverse complement of the strand. Note that in the.gff file these are marked "+" and "-" respectively. Base The letter representing this base: A, C, T, G. Score casemean -10 log (p-value) score for the detection of this event. Analogous to a Phred quality score. A value of 20 is the minimum default threshold for this file, and corresponds to a p-value of A score of 30 corresponds to a p- value of Mean of case IPDs observed at this position. Numerator of IPD ratio. controlmean casestd controlstd ipdratio teststatistic coverage controlcoverage casecoverage Mean of control IPDs observed at this position. Denominator of IPD ratio. Standard deviation of case IPDs observed at this position. Standard deviation of control IPDs observed at this position. casemean/controlmean. t-statistic of two-sample t-test. Mean of case and control coverages. Count of valid IPDs in control sequence at this position (see Filtering section of SMRT pipe documentation for details). Count of valid IPDs in case sequence at this position (see Filtering section of SMRT pipe documentation for details). Page 8

9 Table 4: Fields included in the motifs.gff file (generated only when motif analysis is active). Column Seqid Source Type Start End Score Strand Phase Attributes Description Reference tag (e.g. ref00001). Same as refid in the.csv file. Name of tool -- "kinmodcall". Modification type a generic tag "modified_base" is used for unidentified bases. For identified bases, m6a, m4c, and m5c are used. A. indicates a site where methylation was expected, but was below the significance threshold during the initial kinetics analysis. This suggests a site that is possibly being demethylated in the genome. Location of modification. Location of modification. -10 log (p-value) score for the detection of this event. Analogous to a Phred quality score. A value of 20 is the minimum default threshold for this file, and corresponds to a p-value of A score of 30 corresponds to a p- value of This is the Modification QV for the statistical event detection at this position only, not for the identification. In the case of a multi-site kinetic variation event, such as with Tet-converted 5-mC, it is likely that this score will be very low and the identificationqv (in the Attributes field, below) will contain a higher score that incorporates the full multi-site signal. Native sample strand where kinetics were generated. + is the strand of the original FASTA and - is the reverse complement of the strand. Note that in the.csv file these are marked "0" and "1" respectively. Not applicable. Contains extra fields. IPDRatio is traditional IPD Ratio, context is the reference sequence -20bp to +20bp around the modification plus the base at this location as the 21 st character, and sequencing coverage of that position. Context is always written in 5 -> 3 orientation of the template strand. The id attribute is added with the complete double-strand methyltransferase motif. The motif attribute is added with the single-strand methyltransferase motif for the modification described in this row of the file If the motif is palindromic, then the id and motif attributes will be the same. identificationqv is the score for the identification call, if applicable. This is a separate statistical test from the Score field above. Example: context=atacgccggccataatggcgatcgacattttctcgccacgg;motif=gatc;coverage=99;ipdratio= 3.71;id=GATC;identificationQv=174 Table 5. Fields included in motif_summary.csv (generated only when motif analysis is active). Column motifstring centerpos fraction ndetected ngenome grouptag partnermotifstring meanscore meanipdratio meancoverage objectivescore Detected motif sequence for this site such as GATC. Position in motif of modification (0-based). Description The percent of time this motif is detected as modified in the genome. (Fraction of instances of this motif with modification QV or identification QV above the QV threshold.) Number of instances of this motif that are detected as modified. (Number of instances of this motif with modification QV or identification QV above threshold.) Number of occurances of this motif in the reference sequence genome. A name identifying the complete double-strand recognition motif. For paired motifs this is <motifstring1>/<motifstring2>, for example GAGA/TCTC. For palindromic or unpaired motifs this is the same as motifstring. motifstring of paired motif (motif with reverse-complementary motifstring). Mean Modification QV of instances of this motif that are detected as modified. Mean IPD ratio of instances of this motif that are detected as modified. Mean coverage of instances of this motif that are detected as modified. Score of this motif in the motif finder algorithm. The algorithm considers higher objective scores to be more confidently identified motifs in the genome based on several factors. Page 9

10 Visualizing Sites of Base Modification with SMRT View Figure 5. Example of a SMRT View Screen. The motifs.gff output file, along with modification summary data, can be viewed in SMRT View by clicking on the Tachometer icon in the top tool bar of SMRT View. Figure 6. Tachometer Icon in SMRT View Tool Bar There are several visualization tracks: 1. modifications (regions): This track displays a bar chart for both the plus and minus strands, denoting the number of modification events within a 10 kb window. It is a stacked graph, which also displays the type of modification in each 10 kb window. Figure 7. Modification Regions Showing Number of Modification Events 2. motifs: For each motif detected in the analysis, a sub-track is created to visualize the sites in the genome that are methylated within that motif. A marker is placed on this track denoting the strand on which the modification was detected. If a modification falls outside of one of the discovered motifs, it is placed in a track called others with a label indicating the 7-base context. This context is always represented in 5 to 3 orientation on the template strand. The content of this track is located in the motifs.gff file. Page 10

11 an incorporation of the complement base T, which is the base and kinetic information stored in the FASTA file and bas.h5 file. Figure 8. Marker Showing the Strand Where Event Was Detected 3. Kinetogram: This is a reference sequence view where both strands are shown, with a bar chart displaying the IPD ratio at each position for each strand (see Figure 9). It is visible only when the view is zoomed. It is important to note that the strands are displayed in template space, not in read space. An adenine ( A ) shown in the Kinetogram track and in the subread alignment below it is the adenine in the sequencing template. This means that if a higher IPD ratio value is shown above an A position, it indicates a potentially modified adenine in the original template it will generally correspond to an identified 6-mA in a motif in the modification events track above the Kinetogram. In read space, the actual measured IPD corresponds to The data in the Kinetogram is a ratio of the mean IPD of the genome of interest to the mean IPD of the control at the given position. Hence the IPD ratio values have a baseline at 1. An IPD ratio greater than 1 means that the sequencing polymerase slowed down (relative to the control) at this base position, while an IPD ratio less than 1 indicates an increase in polymerase speed. The data for this track is pulled directly from the cmp.h5 alignment file generated during the analysis job. 4. Aligned reads: By default, this area is hidden. Pressing the View Reads icon in the Details Panel title bar will show the subread alignment. This area displays Variants and Base QV by default. Coverage is displayed on the left, separated by strand (see Figure 10). Using the Preferences menu option, it is possible to switch to other subread data in this panel including the raw IPD and raw Pulse Width. Figure 9. Kinetogram Displaying the IPD Ratio at Each Position for Each Strand Figure 10. Heat Map of Raw IPDs Separated by Strand Page 11

12 SMRT View visualization sessions can be saved to a file, using the File/Save Project menu command. A project file can be shared with other users. i Importing Relevant Annotations Visualizing base modification data in the context of relevant annotation tracks, such as genes, CpG islands, etc., can help relate putative modification events to related biological function. SMRT View can import many types of data files. For more information, see the Pacific Biosciences SMRT View Help menu. Using the Table Browser The Table Browser can be used to look at the highest confidence modification events. By default the table is sorted by QV (score). that region in the graphical view. By inputting text in the search box for a feature label or type in which you are interested, you can quickly find modification events of interest. By pressing the eye icon next to the search box, the events displayed in the main window will also be filtered. For example, try typing unknown in the search box without the quote marks. This will list only motifs that were not detected as modified. Now press the eye icon in the main screen you will notice that the motif tracks will hide most of the motifs and only show the motifs with an unknown Type in other words, those that did not meet a confidence threshold to be declared as modified. These may correlate with other interesting genomic regions. Figure 12. Closer View of a Region with Motif Performing Motif Analysis Since methylation in bacteria generally occurs at specific sequence motifs that are recognized by methyltransferases, a genome-wide analysis of the modified motifs is critical to understanding the bacteria being studied. Figure 11. Table Browser View For modifications that occur at a discovered motif, the Feature label will be the motif. Modifications that pass the confidence threshold will be labeled either by the type of modification (m6a, m4c, m5c) or as modified_base when an identification was not possible. Unmodified instances of the motif have the Type unknown. Double-clicking one of the events listed in the Table Browser will zoom in on A default SMRT Portal analysis job will provide several forms of motif analysis. The first is the Motif Report in SMRT Portal. The second is a file download called motif_summary.csv (see Table 5) that contains similar information which can be easily opened in a spreadsheet program. The third is motifs.gff (see Table 4), which shows all of the sites in the genome that are methylated, all of the sites in the genome with one of the discovered motifs, and the overlap between the methylation and the motifs. The fourth is SMRT View, which allows easy visualization of the information contained in these files. Page 12

13 Figure 13. The Motif report in SMRT Portal The motif report in SMRT Portal, which is similar in content to the motif_summary.csv file, is a high level summary of the motifs discovered in this genome. For each motif detected, statistics are shown describing how often it is detected as methylated in this genome. If you have very low coverage, you may be able to detect the motifs but you will also see a lower value for percentage detected. In cases where a non-palindromic double-stranded methyltransferase recognition site is detected, a Partner Motif is indicated, which is the complementary motif. Example: In our sample E. coli genome, two motifs are modified almost all of the time: GATC and AACNNNNNNGTGC/GCACNNNNNNGTT. GATC is detected as modified over 99% of the time. The second motif is non-palindromic, so it is split into two rows in the motif report AAC and GCA which are modified over 98% of the time. As expected, these paired motifs are both in the genome the same number of times. If the paired motifs are considered together, they are modified ( )/( ) = 98.5% of the time. We also saw three additional motifs modified about 50% of the time. These motifs occur due to a very low detection rate of 5-mC in the Dcm methyltransferase recognition motif, CCWGG. In this case, the motifs have been expanded so that the fraction methylated is high enough for the motif analysis software to call a valid motif. For this genome, the presence of these motifs suggests that an additional sequencing run using Tet-converted DNA should be run in order to also discover the recognition sites for methyltransferases that create 5-mC modifications. If the list of motifs is different from what was expected, or further refining the list is desired, consider re-analyzing the data using a different modification QV threshold for the motif analysis. By default, any modification with a modification QV (confidence score) of 40 or higher is included. By looking at the Modifications report (Figure 4), it is possible to choose a different threshold to include more data in the motif analysis. There are two options for re-analyzing motifs with different thresholds. The first is to re-run the job in SMRT Portal. The setting is found in the Protocol Settings ( button) when making a new analysis job (at the bottom of the Postprocessing tab). This requires re-running the entire analysis including sequence alignment. The other alternative is to download the motifmaker command-line Java program from GitHub ( This is an open-source, unsupported version of the same software that analyzes data in SMRT Portal. It is quicker to run because it uses the output of the SMRT Portal modification analysis as an input (i.e., it starts after the slowest part of the analysis has already been performed). However, it may be more complicated to run since it is a command line program. Additional advanced tools for motif analysis are also available on DevNet at 000GtatAAC. Example: In our sample E. coli data, we see that the Modifications report has strong methylation only on adenines, which is consistent with the Motif report. In the Modification QV Histogram, the red line representing adenine does not trend with the other bases, and it separates at a Modification QV of approximately 60. One option is to re-run the analysis with a new threshold of 60. It is interesting to see how the motif report changes with the new threshold. Try a lower setting, such as 25, to examine the resulting effect. Another component to motif analysis is the association between methyltransferase recognition motifs and the methyltransferase gene that modifies that site. For example, creating this association is important for introducing a particular modification into another bacteria or knocking it out of the one just sequenced. A common place to begin with this analysis, before proceeding to lab validation, is to visit the organism Page 13

14 database at REBASE ( This site lists a variety of bacterial genomes and has information on predicted and verified RM genes in each genome, plus predicted recognition domains for each methyltransferase or restriction endonuclease found in those genomes. By comparing the list generated by SMRT Portal to the predicted list in REBASE, it is often possible to determine which genes are active and responsible for the modifications you have discovered in your bacterial strain. Conclusion SMRT Sequencing allows genome-wide, singlebase-resolution study of bacterial methylation including 6-mA, 5-mC and 4-mC. The analysis of these modifications along with motif analysis is possible today using SMRT Analysis, and will continue to become more accessible and automated. References 1. Geoffrey G. Wilson. Organization of restriction-modification systems. Nucleic Acids Research, (1991) 19(10): doi: /nar/ ). 2. Zweiger et al; A Caulobacter DNA methyltransferase that functions only in the predivisional cell, J Mol Biol, Jan 1994; 235(2): Shrikhanta et al; The phasevarion: phase variation of type III DNA methyltransferases controls coordinated switching in multiple genes. Nature Reviews Microbiology, March 2010, Volume Murray et al. The Methylomes of Six Bacteria, Nucleic Acids Research, Sept Ryan KJ, Ray CG (editors). Sherris Medical Microbiology (4th ed.). McGraw Hill. pp (2004): ISBN Russi et al; "Molecular Machinery for DNA Translocation in Bacterial Conjugation". Plasmids: Current Research and Future Trends. Caister Academic Press. (2008): ISBN ). Endnotes i Adjusting the Region view in SMRT View: To change the regional modification density histogram to a heat map, press the x key. This will make it easier to see the less frequent types of modification. Press x again to change it back. Adjusting the Details view in SMRT View: There are several keyboard commands that can be used to change the size of letters displayed in the heat map view, which may make it easier to see at various zoom levels. To zoom to base resolution, click the Zoom to Base toolbar icon (a magnifying glass over an A). The keyboard commands are: R or r : make the IPD ratio bar chart larger or smaller, respectively. D or d : make the nucleotide letters larger or smaller. T or t: make the annotation tracks larger or smaller. Left arrow or right arrow: After clicking in the top or bottom pane to select that pane, move the view to the left or right. Up arrow or down arrow: After clicking on a base modification event in the Details view, move to the next or previous event in that track. Additionally, moving the mouse over an event will show additional details from the GFF file such as the IPD ratio, score, coverage, and sequence context. 7. Flusberg et al; Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature Methods 7: Clark et al; Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing, Nucleic Acids Research, (2011). 9. Yu et al. Base-Resolution Analysis of 5- Hydroxymethylcytosine in the Mammalian Genome. Cell, Volume 149, Issue 6, , 17 May Page 14

15 For Research Use Only. Not for use in diagnostic procedures. Copyright 2012, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT and SMRTbell are trademarks of Pacific Biosciences in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners. PN Page 15