Focusing on results not data comprehensive data analysis for targeted next generation sequencing



Similar documents
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Delivering the power of the world s most successful genomics platform

Introduction to NGS data analysis

G E N OM I C S S E RV I C ES

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Simplifying Data Interpretation with Nexus Copy Number

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Next Generation Sequencing: Technology, Mapping, and Analysis

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Disease gene identification with exome sequencing

Data Analysis for Ion Torrent Sequencing

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Genomes and SNPs in Malaria and Sickle Cell Anemia

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

LifeScope Genomic Analysis Software 2.5

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Sequencing and microarrays for genome analysis: complementary rather than competing?

Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator

Pairwise Sequence Alignment

Introduction to Bioinformatics 3. DNA editing and contig assembly

Bioinformatics Resources at a Glance

escience and Post-Genome Biomedical Research

How To Find Rare Variants In The Human Genome

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Core Facility Genomics

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

How-To: SNP and INDEL detection

Analysis of NGS Data

Next generation DNA sequencing technologies. theory & prac-ce

Module 1. Sequence Formats and Retrieval. Charles Steward

mygenomatix - secure cloud for NGS analysis

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

Commonly Used STR Markers

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

ACMG clinical laboratory standards for next-generation sequencing

Version 5.0 Release Notes

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Single Nucleotide Polymorphisms (SNPs)

SAP HANA Enabling Genome Analysis

A Primer of Genome Science THIRD

MUTATION, DNA REPAIR AND CANCER

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

GeneProf and the new GeneProf Web Services

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Factors for success in big data science

New solutions for Big Data Analysis and Visualization

Typing in the NGS era: The way forward!

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Annex 6: Nucleotide Sequence Information System BEETLE. Biological and Ecological Evaluation towards Long-Term Effects

Web-Based Genomic Information Integration with Gene Ontology

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Practical Solutions for Big Data Analytics

Practical Guideline for Whole Genome Sequencing

A Hitchhiker s Guide to Next-Generation Sequencing

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

Searching Nucleotide Databases

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

Introduction to next-generation sequencing data

Teaching Bioinformatics to Undergraduates

Personal Genome Sequencing with Complete Genomics Technology. Maido Remm

How Sequencing Experiments Fail

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

Introduction to Genome Annotation

Biological Sequence Data Formats

A guide to the analysis of KASP genotyping data using cluster plots

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Database schema documentation for SNPdbe

CCR Biology - Chapter 9 Practice Test - Summer 2012

Bioinformatics Grid - Enabled Tools For Biologists.

Lecture 3: Mutations

Current Motif Discovery Tools and their Limitations

SUPPLEMENTARY METHODS

1 Mutation and Genetic Change

Challenges associated with analysis and storage of NGS data

Data File Formats. File format v1.3 Software v1.8.0

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

European Medicines Agency

Next generation sequencing (NGS)

Basic processing of next-generation sequencing (NGS) data

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

BioBoot Camp Genetics

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Final Project Report

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Innovations in Molecular Epidemiology

Deep Sequencing Data Analysis: Challenges and Solutions

A map of human genome variation from population-scale sequencing

TruSeq Custom Amplicon v1.5

Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center

All in a highly interactive, easy to use Windows environment.

Complete Genomics Sequencing

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

Transcription:

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes Abstract Oxford Gene Technology (OGT) has developed a fully integrated, targeted sequencing and analysis service, which incorporates primary analysis with advanced variant detection and annotation. The next generation sequencing (NGS) data analysis pipeline provides complete sequence data processing, including annotation, filtering and ranking of single nucleotide variants (SNVs) and insertions and deletions (indels). The results are presented in an easy-to-navigate and interactive web-based report. The pipeline is ideal for biomedical and translational researchers, who do not have ready access to bioinformatics expertise, and is designed to make it easy to understand the results obtained from targeted sequencing and whole exome studies. In addition, expert and bespoke analysis services for more complex sample interrogation are available: for example, loss of heterozygosity analysis, multi-genome comparisons, advanced filtering, trio analysis etc. Introduction With the rapid introduction of NGS into the research pipeline many scientists are keen to reap the benefits this new technology offers. As with any new technique, NGS workflows are complex and change rapidly, making the complexity of data analysis one of the most frequently cited problems. Some of the reasons for this include: Limited access to experienced computational biologists Poor experimental design Data complexity and size Incomplete or inaccurate data analysis Insufficient monitoring of QC steps OGT has developed a complete workflow for targeted sequencing including exome analysis and custom bait hybridisation, which combines project design, target capture, and sequencing with customisable data analysis. This approach allows the researcher to focus on the results, specifically the high-quality, relevant, filtered and ranked variant information, rather than the intermediate steps involved and the large volumes of data produced. To ensure the quality of the results, each experimental and computational step is subject to strict quality checks. OGT s NGS data analysis pipeline To enable the rapid generation of high-quality results OGT has developed a single, integrated and automated NGS analysis pipeline comprising of scalable and rigorously tested bioinformatics tools (Figure 1). Although automated, the pipeline retains flexibility in order to accommodate bespoke analysis for individual projects. The results generated are presented in a user-friendly web-based report, which is easily navigated. Variants are presented as lists that can be filtered and ranked according to userdefined criteria making it easy to focus on their biological significance. The ideal data analysis pipeline should tell you: How well the sequencing run performed? What variants are present in the samples? Which variants are biologically relevant? Free data analysis offer (see back page)

Figure 1: OGT s NGS data analysis pipeline How well did the sequencing run perform? Sequence quality All sequencing runs contain errors and these need to be identified before any data analysis is performed. During sequencing, every base is assigned a Phred quality score. This score is used to ensure poor quality reads and low quality base calls do not compromise the sequence mapping and subsequent variant and indel detection. What variants are present in my samples? generating false positive calls (Figure 2). The OGT NGS pipeline scans the draft alignment to identify problematic areas and then local realignment is carried out on these regions to minimise the number of mismatching bases across all reads to create a clean consensus, allowing reliable variant detection. A Mapping, assembly and alignment The millions of reads that are generated from each run are aligned to a reference genome. The pipeline uses a fast short read aligner based on the Burrows-Wheeler Transform method capable of aligning reads to reference genomes with gaps, essential for indel detection. There are a huge number of short-read alignment packages available and OGT has evaluated them to ensure the data analysis pipeline produces optimum sequence alignment and validation-ready variant calls. Alignment quality An important but easily overlooked step is the refinement of an alignment once generated. Many people assume that the output of a short-read aligner is giving them the best possible set of read alignments to the reference genome, but in practice what is generated is a best guess based on the heuristics of the aligner. It is important to refine the initial alignment for accuracy, especially around problematic sites such as indels. Indels can cause misalignment of reads, Figure 2: The importance of re-alignment around indels. A This image shows the initial alignment of reads, suggesting that the sample is heterozygous for two SNVs and one deletion. After B local re-alignment, it is clear that the SNV calls are artefacts, and the sample is homozygous for the 3 bp deletion. Duplicate identification During sample preparation, the DNA is fragmented and amplified into clusters of randomly overlapping fragments, which are then sequenced from both ends (paired-end sequencing). Ideally each fragment should be sequenced only once. The OGT NGS pipeline verifies the 5 coordinates and mapping orientations of each pair of sequences. Where multiple sequences map to the same location, B

only the nest pair of sequences is processed further. Failure to control for these kinds of duplicates can easily introduce many falsepositive variant calls into an experiment, as a result of PCR bias, making subsequent analysis much more time consuming. Base quality re-calibration Reliable quality scores are a pre-requisite for statistical detection of variants. To improve the base quality scores the pipeline includes an automated step to re-calibrate the initial Phred score. This step is ignored in many NGS pipelines and is achieved by taking into account a number of covariates including the position of the base within the read, the quality of neighbouring bases and the machine cycle before adjusting the initial calls. This enables identification of high quality bases and will increase confidence in the base calls made. Which variants are biologically relevant? The accurate quality scores and high-quality alignments generated by the data analysis pipeline allows reliable detection of SNPs and indels. At this point a tremendous amount of data analysis and QC has been performed on the samples, but the data is still effectively raw consisting of a heavily optimised sequence alignment. Only after this extensive processing is it possible to confidently call variants. OGT includes automated, sensitive filters for both SNPs and indels so that only the highest quality variant calls are processed. Once the list of variants has been generated, these too need to be annotated to identify the most likely candidate mutations in the context of the study. A typical exome will contain approximately 12,500 coding variants, including ~700 coding indels and ~10,400 nonsynonymous single nucleotide variants. Identification and characterisation of the biologically relevant variants is the most important part of the data analysis pipeline. Analysis of variants To enable meaningful filtering and ranking of variants it is essential to determine whether they are functionally relevant. Every SNV and indel is checked for previous characterisation in dbsnp (to assign allele frequencies) then checked against all human transcripts in Ensembl to add biological annotation and determine whether it affects a promoter, untranslated region (UTRs), regulatory region, splice site, intronic region, or coding region. If it affects a coding region, the change is assessed to determine if it is a synonymous or a potentially harmful nonsynonymous variant. Additionally the likely effect on protein function is predicted using SIFT (Sorts Intolerant from Tolerant) and PolyPhen (Polymorphism Phenotyping). SIFT uses sequence homology to predict whether the variant has caused an amino acid substation that will affect the function of the protein, a score of 0 to 1 is given where 0 is damaging and 1 is neutral. PolyPhen also predicts the impact of an amino acid substitution on the structure and function of a human protein. The PolyPhen prediction is based on a series of empirical rules and looks at sequence conservation and the structure of the protein to determine the effect, again a score is given where 0 is neutral and a high positive number is damaging. Condel (Consensus deleteriousness score of missense SNVs) is used to aggregate the scores from SIFT and Polyphen to make an additional weighted prediction. The ideal data analysis pipeline should deliver data that is: Straightforward and simple to understand Flexible and relevant to your project Rapidly accessible, so time is spent on interpretation not handling raw data Easy to share with colleagues

A B C Figure 3: The analysis report. All of the sequencing results and quality control metrics are provided in an easy to navigate web-based interface. This interface can be used on any desktop PC and shared with colleagues. The user can see the A project summary, B information on all variations found listed by sample, which can then be C searched and filtered based upon user criteria with additional interpretation using the extensive links to external data sources (e.g. Ensembl).

Flexible get the data most relevant to your project Although the data analysis pipeline delivers interesting variants from raw data, we understand every project is different, so in addition to our advanced pipeline analysis, OGT can provide expert analysis tailored around individual project requirements. This could be anything from performing multi-genome analysis on matched pairs or trios, focusing on specific candidate loci for custom targeted resequencing projects or analysing the genome for regions where a loss of heterozygosity (LOH) or copy number variation has occurred. Rapid spend time on biological interpretation not analysing raw data OGT s Genefficiency sequencing service takes approximately 8 weeks from sample receipt to result delivery, depending on the number of samples being processed. Bespoke analysis takes a little longer depending upon the individual requirements of the project and is performed in close collaboration with the researcher. Data is delivered in a format that is simple to interrogate and share. As soon as you receive the analysis report you can start answering your research question rather than spending valuable time trying to verify the quality of the underlying data, or assessing and running appropriate analysis tools. For more information about Genefficiency Sequencing Services, visit www.ogt.co.uk/genefficiency or contact us on +44 (0) 1865 856826 Free data analysis on all targeted sequencing projects Get free and comprehensive data analysis on all targeted sequencing services, including whole exome or custom panel sequencing until 10 August 2012.* This offer includes the following analyses: Filtering and annotation of variants with dbsnp and consequence prediction with SIFT and PolyPhen Multi-genome comparison (e.g. trio analysis) Cancer exome analysis (i.e. identification of somatic mutations, CNV analysis, and LOH analysis in paired tumour/normal samples) For more information or to request a quote, contact OGT at contact@ogt.co.uk or call +44 (0)1865 856826. * Offer applicable to all 1 st orders of OGT s Genefficiency Targeted Sequencing Service including both capture and sequencing. Offer expires on 10 August 2012. Please enquire for full details and pricing. Oxford Gene Technology T: +44 (0)1865 856826 E: services@ogt.co.uk W: www.ogt.co.uk This document and its contents are Oxford Gene Technology IP Limited 2012. All rights reserved. OGT, Genefficiency and Oxford Gene Technology are trademarks of Oxford Gene technology IP Limited.