MiSeq: Imaging and Base Calling



Similar documents
Introduction to NGS data analysis

Introduction to next-generation sequencing data

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

How Sequencing Experiments Fail

Illumina Sequencing Technology

Copyright 2007 Casa Software Ltd. ToF Mass Calibration

PreciseTM Whitepaper

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Structural Health Monitoring Tools (SHMTools)

Sequencing Analysis Software User Guide

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

NGS data analysis. Bernardo J. Clavijo

Next Generation Sequencing

TruSeq Custom Amplicon v1.5

G E N OM I C S S E RV I C ES

Core Facility Genomics

10g versions followed on separate paths due to different approaches, but mainly due to differences in technology that were known to be huge.

July 7th 2009 DNA sequencing

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

Real-Time PCR Vs. Traditional PCR

agucacaaacgcu agugcuaguuua uaugcagucuua

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

4. GS Reporter Application 65

The Effect of Network Cabling on Bit Error Rate Performance. By Paul Kish NORDX/CDT

Data Analysis for Ion Torrent Sequencing

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Products - Microarray Scanners - SpotLight Two-Color Microarray Fluorescence Scanners New! SpotLight 2 now available!

GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

Alignment and Preprocessing for Data Analysis

Analysis of ChIP-seq data in Galaxy

MiSeq Reporter Generate FASTQ Workflow Guide

Technical Note. Roche Applied Science. No. LC 19/2004. Color Compensation

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Lecture 10: Regression Trees

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

COMPENSATION MIT Flow Cytometry Core Facility

MI Software. Innovation with Integrity. High Performance Image Analysis and Publication Tools. Preclinical Imaging

DNA Sequence Analysis

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

History of DNA Sequencing & Current Applications

Searching Nucleotide Databases

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Analyzing A DNA Sequence Chromatogram

Current Probes, More Useful Than You Think

Nonlinear Iterative Partial Least Squares Method

NHA. User Guide, Version 1.0. Production Tool

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

Background Information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Proton Nuclear Magnetic Resonance Spectroscopy

Cluster Generation. Module 2: Overview

Introduction. What Can an Offline Desktop Processing Tool Provide for a Chemist?

ONYX. Preflight Training. Navigation Workflow Issues Preflight Tabs

Imaging and Bioinformatics Software

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

DNA Detection. Chapter 13

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

A Brief Guide to Interpreting the DNA Sequencing Electropherogram Version 3.0

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

SSO Transmission Grating Spectrograph (TGS) User s Guide

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

White Paper. "See" what is important

Computational Genomics. Next generation sequencing (NGS)

Module 13 : Measurements on Fiber Optic Systems

PCM Encoding and Decoding:

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

1 st day Basic Training Course

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

UPS System Capacity Management Configuration Utility

SEQUENCING. From Sample to Sequence-Ready

Using NIST Search with Agilent MassHunter Qualitative Analysis Software James Little, Eastman Chemical Company Sept 20, 2012.

Biological Neurons and Neural Networks, Artificial Neurons

How is genome sequencing done?

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Real-time PCR: Understanding C t

SRA File Formats Guide

Automated Lab Management for Illumina SeqLab

A Method Using ArcMap to Create a Hydrologically conditioned Digital Elevation Model

Quick Start Guide Chromeleon 7.2

Optimizing IP3 and ACPR Measurements

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

Notice. DNA Sequencing Module User Guide

NGS Technologies for Genomics and Transcriptomics

Spectrophotometry and the Beer-Lambert Law: An Important Analytical Technique in Chemistry

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

BD FACSComp Software Tutorial

A Proposal for OpenEXR Color Management

Sampling Theorem Notes. Recall: That a time sampled signal is like taking a snap shot or picture of signal periodically.

Understanding Mixers Terms Defined, and Measuring Performance

Understanding the DriveRack PA. The diagram below shows the DriveRack PA in the signal chain of a typical sound system.

Artisan Scientific is You~ Source for: Quality New and Certified-Used/Pre:-awned ECJuiflment

Transcription:

MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please take a moment to familiarize yourself with the navigation for this course. You can access these instructions by clicking Help in the top right corner of the page. This course is presented by Jeremy Peirce, Staff Field Applications Scientist for Illumina. Sequencing on the MiSeq, as with other Illumina Sequencers, begins with sample preparation. A DNA library is prepared with adapters A and B, one on either end, of the insert to be sequenced. Following this, clusters are formed and a flow cell is placed on the MiSeq. Images are then taken using the MiSeq image control software and images are processed to extract intensities and then base calls. Following extraction of base calls, the MiSeq provides additional data analysis tools using the MiSeq Data Reporter software. MiSeq Sequencing Workflow Again, the MiSeq Sequencing workflow begins with image processing and intensity extraction from those images followed by base calling based on the intensities. The latter two pieces of this process are controlled by RTA (Real Time Analysis) software. Which is a program launched automatically by the MiSeq Control Software. RTA has two functions analysis of images and calling of bases. The input to image analysis is of course the tiff images and the outputs are intensities and cluster positions encoded in cluster intensity files and compressed location files. Base callings input are the compressed intensity files and the outputs are base call files which are a binary representation of base calls and quality scores. Image Analysis Images Generated on the Instrument Image Analysis To begin with images are generated on the instrument. MCS the MiSeq Control Software controls image generation for the MiSeq and images are generated with four images per tile per cycle. One cycle includes the chemical addition and imaging of one base for each cluster on the flow cell. For imaging purposes the MiSeq flow cell is broken up into areas or tiles. For each tile in each cycle an image is taken for every base.

Image Analysis Input and Output Clusters are Bright Spots in the Image Image analysis inputs are the tiff images and the RunInfo.xml, which contains information about the run. These are processed through image analysis to extract intensities and locations of clusters. Clusters are bright spots in the image. They represent approximately a thousand copies of the same DNA strand in a one micron spot. Each image then represents the florescence in one of four channels representing the four possible nucleotides; G, A, T or C. Spots that produce signal in the C channel, for instance, which represents the C nucleotides in a given cycle will also have some signal in the A channel due to a phenomenon known as spectral overlap. As can be seen in the diagram at right, the A and the C channels emissions spectral are somewhat overlapping. So that an image taken in the A channel will have some signal from the C channel and vice versa. Multi Cycle Detection of Cluster Positions We use multi cycle detection to differentiate cluster positions over a series of 4 cycles, the first four cycles imaged on the MiSeq. Multi cycle cluster detection improves our positioning and resolution of clusters. In the example here, it is difficult to resolve the cluster pictured in one cycle only because when there are overlapping clusters with the same base the cluster will simply look more like an oval cluster rather than two clusters with the same base. However, if we use multiple cycles it is easy to see the difference between the two clusters when they do not have the same base in the same cycle. By detecting clusters positions in multiple cycles then, there is a better chance of resolving overlapping clusters. Since they are less likely to be of the same base over the four cycles used to detect the cluster positions. Cluster Detection Algorithm The cluster detection algorithm looks for the maximum intensity area of the cluster and also takes a measure of the threshold background noise level on the flow cell. When there a two clusters close to each other integrating over the entire area of the cluster, it would make it difficult to distinguish between the two clusters. So we take a thinner slice from the center of the maximum intensity region of the cluster. We use sub pixel interpolation to generate the position of the best signal for the cluster. In this case, the interpolated position for best signal is between 77 and 78 pixels. If we were to choose 77 or 78 pixels, rather than interpolating between them we might not get the best intensity for that cluster. Accounting for sub pixel interpolation results in more detected clusters and more accurate error rates. Cluster Detection Algorithm After all clusters are detected in this way the cluster intensities are extracted and stored.

Intensity Extraction Algorithm Process of Extracting Intensity Input and Output Viewed another way the intensity extraction algorithm defines an integration area. As a top view, you can see this integration area is essentially at the center of a given cluster in the x and y. The intensity is then extracted from this area for all four bases from the four noted images. In this case A has a much higher intensity than the three other possibilities. Note that C is a little higher than G or T because the A and C emissions spectra are somewhat closer so there is some overlap in the emission spectra. When the real base is A, there will be some additional background in C. The process of extracting intensities, first computes the background for each cluster and signal for each cluster. Then subtracts the background from the signal for each cluster. As noted, we are collecting background from a distance outside the cluster and signal is collected from the very center; the most intense region of the cluster. The inputs to base calling are the compressed intensity files and the compressed location files. After intensity is extracted, it is put through intensity normalization and cross talk estimation. To account for differences in dye deficiencies and the differences in the amount of cross talk between bases. These will be discussed in a moment. Bases are then called and quality scores are determined and the results are stored in fastq files as the output of base calling. Important Caveat There is an important caveat to base calling on the MiSeq; the intensity correction process requires a reasonable base composition to work well. This balance does not have to be perfect and indeed we have sequenced both Rhodobacter at 69% GC and B. Cereus at 36% GC and both give good results. Intensity correction is calculated across the first 12 base pairs and since the MiSeq does not have a separate control lane samples without base composition balance in the first 12 bases may be challenging. These samples could be run on the HiSeq or GA which have the ability to use control lanes to make up for this requirement. However, it is important to notice that we have demonstrated between 36 and 69% GC balanced given good results and it probable that the good results are available outside this range. Intensity Normalization Intensity normalization takes the raw peak intensities and normalizes them for all four bases; G, A, T, and C. So that they have a common mean to reduce the effects from different dye channels efficiencies. At the left, there is a simplified example for two nucleotides that initially have different intensities per molecule. Post normalization these intensities would be the same.

Cross Talk Estimation In addition, we estimate cross talk values. Cross talk values are the values for signal overlap between the emissions spectra between the dyes used. The MiSeq uses two wavelengths to excite the fluors but reads four emission spectra, one for each base. For overlapping emissions spectra, it is possible to define and remove overlaps in signal to achieve to purer read. Graphical View of Cross Talk Correction Background Noise Correction As a graphical view of cross talk correction, on the left, you can see that signal in one channel pre cross talk correction effects signal in the other channel. On the right hand side, this cross talk has been eliminated and signal in one channel is independent of signal in the second channel. So the channels themselves are parallel to one or the other axes. We also improved signal purity by correcting for the small number of strands at each cycle that do not add the base pair expected. In this case, within a single cluster of about a thousand strands the vast majority will have added C as expected. However, a very small number of cycle of strands will be phase that is say they will be one base pair behind the current base pair addition. And an additional small number will be prephased that is to say they will be at least one base pair ahead of the expected position. Correction for phasing and prephasing requires unbiased base content of template that is used for the calculation as discussed previously. It is worth noting that the last base of a read cannot be corrected for prephasing and so typically has a slightly increased amount of noise relative to previous cycle. After phasing, prephasing, normalization and cross talk correction are taken into account, base calling is performed by looking at the highest intensity signal. The base then with the highest intensity is the one called except for base positions where all bases are very low intensity where we may call no base. In this case, the C intensity is much higher than the A, G, T and the base called is C. Quality Filtering of Clusters Once the first 25 bases have been called, the CHASTITY filter is calculated for each cluster over the first 25 bases of sequence. CHASTITY is a rough measure of signal purity over the 25 bases of sequence. It is the ratio of the highest intensity to the sum of the highest and second highest intensities. As shown in the diagram, the highest intensities is C and the second highest is intensity is G. These are noted as I a and I b. CHASTITY, C, is the ratio of I a over the sum of I a plus I b.

Quality Filtering of Clusters For perfect signal, the second highest signal is zero, CHASTITY is one because the formula reduces to I a over I a. For very poor signal where the highest and second highest signals are equal, I a equal I b, the equation reduces to zero point 5, one over two, or one half. To pass filter all but one of the first 25 bases of sequence for each cluster must have a CHASTITY of at least point six. This quality filter removes overlapping and low intensity clusters. Base Calls Quality Scores Base Call Quality Scores Important Quality Score Caveats Quality scoring is then performed on base calls. Quality scoring predicts the probability of an error in base calling and is calculated using a model that matches observed values of quality predictors to a pre calculated table. This method is very similar to Phred quality scores for Sanger sequencing. Although since the chemistry is different, the predictors of course are also different. Base call quality scores are calculated after quality filtering when CHASTITY is performed. They are generated at the beginning of cycle 25. Quality scores are usually expressed as 1 error in 10 to the q over the 10 base calls of Q quality. For instance, if were to take a base quality score of 30 which is typically accounted as good sequence quality. This would represent one error in 10 to the 30 over 10 or 10 to the 3 or a thousand base calls. This would equate to a base call accuracy of 99.9% and is often referred to as bases of Q 30 quality. There are a number of important caveats, however to the quality score. Quality scoring is based on a model and can be very accurate when conditions used to build the quality table are similar to the data whose quality is to be predicted. At Illumina, our Q tables are created at standard cluster densities and very well characterized DNA samples and are accurate for: Standard library types similar to genomic samples that we train our data on They are also accurate, or most accurate at standard cluster densities and In supported read lengths Quality score specifications for the percent Q30 apply when the above conditions are true. That is to say the library types are similar to genomic samples and the sequencing run is at standard cluster densities and at supported read lengths.

*_fastq After : MiSeq Reporter Course Complete After generation, data is saved in a fastq format, which is an industry standard. This fastq format contains: Information about the read The sequence of the read As well as the Q scores for each base in the read sequence all in a convenient file format. After base calling is completed and fastq files are generated, the MiSeq Reporter or third party software can be used to perform additional data analysis. The MiSeq Reporter is an integrated data analysis package that facilitates a number of workflows that are common to many next generation sequence analyses. These include: Resequencing Amplicon Sequencing Library QC smrna Analysis De Novo assembly 16S Metagenomics analyses MiSeq Reporter operations will be covered in a separate module. This concludes the presentation. Congratulations! You have completed this course.