Statistics Output Guide



Similar documents
Software Getting Started Guide

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Alignment and Preprocessing for Data Analysis

ECDL / ICDL Project Planning Syllabus Version 1.0

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

How to Run the PacBio RS II Instrument. 1. Run Instrument. 1.1 How to Run the RS II Instrument

Structural Health Monitoring Tools (SHMTools)

Confidence Intervals for the Difference Between Two Means

Polynomial Neural Network Discovery Client User Guide

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Data exploration with Microsoft Excel: univariate analysis

ECDL / ICDL Project Planning Project Management Software Level 2. Syllabus Version 1.0 (UK)

How To Use A High Definition Oscilloscope

A GPS Digital Phased Array Antenna and Receiver

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

WebSphere Business Monitor

HD Radio FM Transmission System Specifications Rev. F August 24, 2011

Exercise 1.12 (Pg )

FREE TV AUSTRALIA OPERATIONAL PRACTICE OP- 59 Measurement and Management of Loudness in Soundtracks for Television Broadcasting

Considerations for Using STAR Data with Educator Evaluation: Ohio

Measurement with Ratios

Northumberland Knowledge

Open Cloud Computing Interface - Monitoring Extension

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Sequencing Analysis Software Version 5.1

Organizing Your Approach to a Data Analysis

Normality Testing in Excel

Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm

Descriptive Statistics

FileNet System Manager Dashboard Help

Grade 6 Mathematics Performance Level Descriptors

PRODUCT DATA. PULSE Data Manager Types 7767-A, -B and -C. Uses and Features

Realtime FFT processing in Rohde & Schwarz receivers

Analyses: Statistical Measures

Scope and Sequence KA KB 1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Grade 6 Mathematics Assessment. Eligible Texas Essential Knowledge and Skills

Activity 3.7 Statistical Analysis with Excel

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Lesson 4 Measures of Central Tendency

IBM SPSS Data Preparation 22

Evaluating System Suitability CE, GC, LC and A/D ChemStation Revisions: A.03.0x- A.08.0x

Measurement Information Model

Foundation of Quantitative Data Analysis

Project Proposal: SAP Big Data Analytics on Mobile Usage Inferring age and gender of a person through his/her phone habits

SUBJECT: New Features in Version 5.3

The Scientific Data Mining Process

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Statistical Data analysis With Excel For HSMG.632 students

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

Instructions for SPSS 21

Next Generation Sequencing

IBM SPSS Direct Marketing 23

IN current film media, the increase in areal density has

Evaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University

5.2 Drawing Information Consultants Procedure Manual. 5.2 Drawing Information: Large Format Document Standards: Page 1 of 5

Data Preprocessing. Week 2

Prentice Hall Mathematics: Course Correlated to: Arizona Academic Standards for Mathematics (Grades 6)

Video Codec Requirements and Evaluation Methodology

IBM SPSS Direct Marketing 22

Interpreting Data in Normal Distributions

USB 3.0* Radio Frequency Interference Impact on 2.4 GHz Wireless Devices

ProteinPilot Report for ProteinPilot Software

Descriptive statistics parameters: Measures of centrality

Big Ideas in Mathematics

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

AN Application Note: FCC Regulations for ISM Band Devices: MHz. FCC Regulations for ISM Band Devices: MHz

NEW MEXICO Grade 6 MATHEMATICS STANDARDS

Statistics Chapter 2

Wave Analytics Data Integration

Framing Business Problems as Data Mining Problems

Summarizing and Displaying Categorical Data

EcOS (Economic Outlook Suite)

2030 Districts Performance Metrics Toolkit

KM Metering Inc. EKM Dash User Manual. EKM Metering Inc. (831)

Packet TDEV, MTIE, and MATIE - for Estimating the Frequency and Phase Stability of a Packet Slave Clock. Antti Pietiläinen

Bar Graphs and Dot Plots

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Data exploration with Microsoft Excel: analysing more than one variable

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Characterization and Modeling of Packet Loss of a VoIP Communication

Determine whether the data are qualitative or quantitative. 8) the colors of automobiles on a used car lot Answer: qualitative

NIS-Elements: Using Regions of Interest (ROIs) & ROI Statistics

Adding Pneumatic Preset Counter. Type 497. Continuously visible preset Integrated pneumatic reset 3 or 5-digit display Convenient button setting

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistical Distributions in Astronomy

Instruction Manual Service Program ULTRA-PROG-IR

Procedure for Writing Technical Procedures

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

The Effect of Network Cabling on Bit Error Rate Performance. By Paul Kish NORDX/CDT

Visual Structure Analysis of Flow Charts in Patent Images

Exporting Client Information

Transcription:

Statistics Output Guide Introduction This document describes the file stats.xml, which is produced by the primary analysis pipeline. This file packages summary statistics from a single movie acquisition. The latest version of this documentation is available from the PacBio Developer s Network, at http://www.pacbiodevnet.com/tech_lib. The PipelineStats primary analysis component computes and packages summary statistics from a single movie acquisition on the instrument. It gathers the ZMW metrics defined in the table below, computes all summary statistics, and writes them out to the XML summary statistics file stats.xml. This information is easily consumed by other parts of the system, including instrument management or LIMS systems, without the need for more complex or low-level data APIs. Metrics are computed by the appropriate pipeline stage: PulseToBase is the pipeline stage that implements the core basecalling algorithm: Takes the input pulse stream for each ZMW. Classifies pulses as bases (or not). Assigns quality values to bases. Produces the <>.bas.h5 file. TraceToPulse is the pipeline stage that implements trace signal processing, pulse detection and pulse classification: Takes an input DWS trace and trace metadata for each ZMW. Estimates and subtracts baseline. Detects and classifies pulses. Produces the <>.pls.h5 file. A single PipelineStats component reads metrics through the API for the full movie, computes and writes summary statistics to the XML file, and writes a CSV (comma-separated value) file as an auxiliary result. Page 1

ZMW Metrics in stats.xml Note: = PulseToBase, = TraceToPulse ZMW Metric Name Definition Source HQRegion After basecalling, an algorithm determines the contiguous region of the trace that contains high quality sequence data. This region is named HQRegion. There will always be 0 or 1 HQRegions found in each trace. The ZMW metrics listed below that are generated by PulseToBase () are defined only on the bases falling within the HQRegion. The extent of the HQRegion is defined in bases. For example, the HQRegionStart and HQRegionEnd fields in the sts.csv file give the indices of the first and last basecall in the HQRegion. Base Fraction The fraction of total basecalls made in each base channel. Base Rate The mean global base rate, in bases per seconds. The time window for rate calculation is [ time(end of last base) time(start of first base) ]. Base Width The mean pulse width of pulses called as bases, in seconds. Baseline Sigma A vector-valued hole metric that is an estimate of the mean value of the background (characterized by trace regions with no pulse events present), over the duration of the movie, in each of the DWS trace channels. The units used are photo-electrons. (DWS is dye-weighted sum, referring to the canonical trace representation.) A vector-valued hole metric that is an estimate of the standard deviation of the background (after some locally smooth mean-background subtraction), over the duration of the movie, in each of the DWS trace channels. The units used are photo-electrons. Number of Pulses The total number of pulses identified and input to PulseToBase. Productivity Pulse Rate A prediction that the ZMW was in one of three possible states for the duration of the movie: 0: Empty - No enzyme complex. 1: Productive - Sequencing, single enzyme complex. 2: Other - Multiple occupation or other condition resulting in low-quality data. The mean global pulse rate, in pulses per seconds. The time window for rate calculation is [ time(end of last base) time(start of first base) ]. Pulse Width The mean pulse width, in seconds. Read Length The number of called bases in a ZMW read Read Score SNR HQRegionSnr CmBasQv CmDelQv A scalar-valued metric on a ZMW that predicts accuracy. The metric takes on values in the range 0-1, where 0.95 predicts an overall accuracy of 95% as measured by an alignment of the read to its true template sequence. Signal-to-Noise Ratio: (median called-base pkmid) / (channel baseline sigma), computed for each channel. For holes and/or channels where no bases are called, or for which all called bases have no pkmid defined, the SNR is defined as 0. The Signal-to-Noise ratio computed as above, but limited to only the inside the HQRegion. For holes without an HQRegion, the SNR is defined as 0. Channel-mean base quality value A vector-valued metric defined as the mean QualityValue over bases called in each of (A,C,G,T) channels. A vector-valued metric defined as the mean base Deletion QV over bases called in each of (A,C,G,T) channels. Page 2

ZMW Metric Name Definition Source CmInsQv CmSubQv RmBasQv RmDelQv RmInsQv RmSubQv A vector-valued metric defined as the mean base Insertion QV over bases called in each of (A,C,G,T) channels. A vector-valued metric defined as the mean base Substitution QV over bases called in each of (A,C,G,T) channels. Read-mean base quality value A scalar-valued metric defined as the mean (total) Quality Value over all called bases in the read. A scalar-valued metric defined as the mean base Deletion QV over all called bases in the read. A scalar-valued metric defined as the mean base Insertion QV over all called bases in the read. A scalar-valued metric defined as the mean base Substitution QV over all called bases in the read. Movie Metrics in stats.xml The following definitions specify movie metrics. These comprise the reduced or summary values (statistics) computed and reported by PipelineStats. In most cases, summary metrics (such as the mean, median, and so on) are packaged in a representation of the distribution of the full sample or a filtered sample; for example, only productive holes. The representation of distributions of continuous metrics includes: A sample histogram. The total number of counts in the sample. A computation of the mean, the median and the standard deviation of the sample. Relevant information about how the histogram is formed: bin intervals, outliers, and so on. A presentation-ready description of the metric or classification; for example, to label a plot. The representation of distributions of discrete metrics (hole classifications) includes: A name for each outcome. The total number of counts in the sample. The counts for each outcome over the sample. Page 3

Movie Metric Name Total Base Fraction Baseline Sigma Movie Read Quality Movie Score Productivity Pulse Rate Pulse Width Read Length Read Quality RmBasQv SNR High Quality Base PkMid Sequencing ZMWs Antiholes Antimirrors Fiducial ZMWs Definition The total fraction of A, C, G, and T basecalls for all productive reads. The calculation is made from the input data by computing the number of each base type called by a ZMW (such as BaseFraction_A * ReadLength) and then summing over the ZMWs that pass the Productivity = 1 filter. The distribution of for all holes, in all channels (independently). The distribution of Baseline Sigma for all holes, in all channels (independently). The distribution of Read Quality for all holes in the movie. A score in the range 0-1 that can be used to form a quantitative assessment of overall movie quality. The distribution of the productivity classification. The distribution of Pulse Rate for productive holes. The distribution of Pulse Width for productive holes. The distribution of Read Length for productive holes. The distribution of Read Quality for productive holes. The distribution of the RmBasQv (read-mean base quality value) metric over productive holes. The distribution of the SNR (signal-to-noise-ratio) over productive holes for channels A,C,G,T (independently). The distribution of PkMid from bases in HQRegions (over any holes containing HQRegions possibly non-productive holes) for channels A,C,G,T (independently). The distribution of the Baseline levels over holes designated as sequencing ZMWs for channels A,C,G,T (independently). The distribution of the Baseline levels over antiholes for channels A,C,G,T (independently). An antihole is a grid position containing a micro mirror without a ZMW. This is useful for measuring certain components of background. The distribution of the Baseline levels over antimirrors for channels A,C,G,T (independently). An antimirror is a grid position with no micro mirror and no ZMW. This is useful for measuring certain components of background, and is also used for gridding. The distribution of the Baseline levels over fiducial ZMWs for channels A,C,G,T (independently). A fiducial is a non-standard ZMW; possibly larger than normal. Page 4

Abbreviations Used in stats.xml The following abbreviations and conventions are used in code variables and text file headings to convert the full names defined in the previous table to a short form. Abbreviation Definition Dist Len Max Med Min Num Prod Qual Std Length Maximum Median Minimum Number of Productive Quality Standard Deviation All quantities that are acronyms, such as SNR, are modified to standard sentence-case form for labels. Example: Snr. For sample statistics, the identity of the statistic is appended, not prepended, to the metric name. Example: ReadLenMed, not MedReadLen. All quantities that are vector-valued, where the vector is a 4-vector corresponding to base channels, have component quantities identified by appending _ followed by the corresponding base label. Examples: Abbreviation ReadLenMed ReadLenDist ProductivityDist SnrMean_A BaselineSigmaStd_T Definition The median read length of productive holes. The distribution of the read length of productive holes. The distribution of the productivity classification. The mean SNR of the A channel for all holes. The standard deviation of the Baseline Sigma metric, for all holes in the T channel. Page 5

For Research Use Only. Not for use in diagnostic procedures. Copyright 2010-2011, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions and the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, SMRT and SMRTbell are trademarks of Pacific Biosciences in the US and/or certain other countries. All other trademarks are the sole property of their respective owners. P/N 001-137-969-01 Page 6