Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes Abstract Oxford Gene Technology (OGT) has developed a fully integrated, targeted sequencing and analysis service, which incorporates primary analysis with advanced variant detection and annotation. The next generation sequencing (NGS) data analysis pipeline provides complete sequence data processing, including annotation, filtering and ranking of single nucleotide variants (SNVs) and insertions and deletions (indels). The results are presented in an easy-to-navigate and interactive web-based report. The pipeline is ideal for biomedical and translational researchers, who do not have ready access to bioinformatics expertise, and is designed to make it easy to understand the results obtained from targeted sequencing and whole exome studies. In addition, expert and bespoke analysis services for more complex sample interrogation are available: for example, loss of heterozygosity analysis, multi-genome comparisons, advanced filtering, trio analysis etc. Introduction With the rapid introduction of NGS into the research pipeline many scientists are keen to reap the benefits this new technology offers. As with any new technique, NGS workflows are complex and change rapidly, making the complexity of data analysis one of the most frequently cited problems. Some of the reasons for this include: Limited access to experienced computational biologists Poor experimental design Data complexity and size Incomplete or inaccurate data analysis Insufficient monitoring of QC steps OGT has developed a complete workflow for targeted sequencing including exome analysis and custom bait hybridisation, which combines project design, target capture, and sequencing with customisable data analysis. This approach allows the researcher to focus on the results, specifically the high-quality, relevant, filtered and ranked variant information, rather than the intermediate steps involved and the large volumes of data produced. To ensure the quality of the results, each experimental and computational step is subject to strict quality checks. OGT s NGS data analysis pipeline To enable the rapid generation of high-quality results OGT has developed a single, integrated and automated NGS analysis pipeline comprising of scalable and rigorously tested bioinformatics tools (Figure 1). Although automated, the pipeline retains flexibility in order to accommodate bespoke analysis for individual projects. The results generated are presented in a user-friendly web-based report, which is easily navigated. Variants are presented as lists that can be filtered and ranked according to userdefined criteria making it easy to focus on their biological significance. The ideal data analysis pipeline should tell you: How well the sequencing run performed? What variants are present in the samples? Which variants are biologically relevant? Free data analysis offer (see back page)
Figure 1: OGT s NGS data analysis pipeline How well did the sequencing run perform? Sequence quality All sequencing runs contain errors and these need to be identified before any data analysis is performed. During sequencing, every base is assigned a Phred quality score. This score is used to ensure poor quality reads and low quality base calls do not compromise the sequence mapping and subsequent variant and indel detection. What variants are present in my samples? generating false positive calls (Figure 2). The OGT NGS pipeline scans the draft alignment to identify problematic areas and then local realignment is carried out on these regions to minimise the number of mismatching bases across all reads to create a clean consensus, allowing reliable variant detection. A Mapping, assembly and alignment The millions of reads that are generated from each run are aligned to a reference genome. The pipeline uses a fast short read aligner based on the Burrows-Wheeler Transform method capable of aligning reads to reference genomes with gaps, essential for indel detection. There are a huge number of short-read alignment packages available and OGT has evaluated them to ensure the data analysis pipeline produces optimum sequence alignment and validation-ready variant calls. Alignment quality An important but easily overlooked step is the refinement of an alignment once generated. Many people assume that the output of a short-read aligner is giving them the best possible set of read alignments to the reference genome, but in practice what is generated is a best guess based on the heuristics of the aligner. It is important to refine the initial alignment for accuracy, especially around problematic sites such as indels. Indels can cause misalignment of reads, Figure 2: The importance of re-alignment around indels. A This image shows the initial alignment of reads, suggesting that the sample is heterozygous for two SNVs and one deletion. After B local re-alignment, it is clear that the SNV calls are artefacts, and the sample is homozygous for the 3 bp deletion. Duplicate identification During sample preparation, the DNA is fragmented and amplified into clusters of randomly overlapping fragments, which are then sequenced from both ends (paired-end sequencing). Ideally each fragment should be sequenced only once. The OGT NGS pipeline verifies the 5 coordinates and mapping orientations of each pair of sequences. Where multiple sequences map to the same location, B
only the nest pair of sequences is processed further. Failure to control for these kinds of duplicates can easily introduce many falsepositive variant calls into an experiment, as a result of PCR bias, making subsequent analysis much more time consuming. Base quality re-calibration Reliable quality scores are a pre-requisite for statistical detection of variants. To improve the base quality scores the pipeline includes an automated step to re-calibrate the initial Phred score. This step is ignored in many NGS pipelines and is achieved by taking into account a number of covariates including the position of the base within the read, the quality of neighbouring bases and the machine cycle before adjusting the initial calls. This enables identification of high quality bases and will increase confidence in the base calls made. Which variants are biologically relevant? The accurate quality scores and high-quality alignments generated by the data analysis pipeline allows reliable detection of SNPs and indels. At this point a tremendous amount of data analysis and QC has been performed on the samples, but the data is still effectively raw consisting of a heavily optimised sequence alignment. Only after this extensive processing is it possible to confidently call variants. OGT includes automated, sensitive filters for both SNPs and indels so that only the highest quality variant calls are processed. Once the list of variants has been generated, these too need to be annotated to identify the most likely candidate mutations in the context of the study. A typical exome will contain approximately 12,500 coding variants, including ~700 coding indels and ~10,400 nonsynonymous single nucleotide variants. Identification and characterisation of the biologically relevant variants is the most important part of the data analysis pipeline. Analysis of variants To enable meaningful filtering and ranking of variants it is essential to determine whether they are functionally relevant. Every SNV and indel is checked for previous characterisation in dbsnp (to assign allele frequencies) then checked against all human transcripts in Ensembl to add biological annotation and determine whether it affects a promoter, untranslated region (UTRs), regulatory region, splice site, intronic region, or coding region. If it affects a coding region, the change is assessed to determine if it is a synonymous or a potentially harmful nonsynonymous variant. Additionally the likely effect on protein function is predicted using SIFT (Sorts Intolerant from Tolerant) and PolyPhen (Polymorphism Phenotyping). SIFT uses sequence homology to predict whether the variant has caused an amino acid substation that will affect the function of the protein, a score of 0 to 1 is given where 0 is damaging and 1 is neutral. PolyPhen also predicts the impact of an amino acid substitution on the structure and function of a human protein. The PolyPhen prediction is based on a series of empirical rules and looks at sequence conservation and the structure of the protein to determine the effect, again a score is given where 0 is neutral and a high positive number is damaging. Condel (Consensus deleteriousness score of missense SNVs) is used to aggregate the scores from SIFT and Polyphen to make an additional weighted prediction. The ideal data analysis pipeline should deliver data that is: Straightforward and simple to understand Flexible and relevant to your project Rapidly accessible, so time is spent on interpretation not handling raw data Easy to share with colleagues
A B C Figure 3: The analysis report. All of the sequencing results and quality control metrics are provided in an easy to navigate web-based interface. This interface can be used on any desktop PC and shared with colleagues. The user can see the A project summary, B information on all variations found listed by sample, which can then be C searched and filtered based upon user criteria with additional interpretation using the extensive links to external data sources (e.g. Ensembl).
Flexible get the data most relevant to your project Although the data analysis pipeline delivers interesting variants from raw data, we understand every project is different, so in addition to our advanced pipeline analysis, OGT can provide expert analysis tailored around individual project requirements. This could be anything from performing multi-genome analysis on matched pairs or trios, focusing on specific candidate loci for custom targeted resequencing projects or analysing the genome for regions where a loss of heterozygosity (LOH) or copy number variation has occurred. Rapid spend time on biological interpretation not analysing raw data OGT s Genefficiency sequencing service takes approximately 8 weeks from sample receipt to result delivery, depending on the number of samples being processed. Bespoke analysis takes a little longer depending upon the individual requirements of the project and is performed in close collaboration with the researcher. Data is delivered in a format that is simple to interrogate and share. As soon as you receive the analysis report you can start answering your research question rather than spending valuable time trying to verify the quality of the underlying data, or assessing and running appropriate analysis tools. For more information about Genefficiency Sequencing Services, visit www.ogt.co.uk/genefficiency or contact us on +44 (0) 1865 856826 Free data analysis on all targeted sequencing projects Get free and comprehensive data analysis on all targeted sequencing services, including whole exome or custom panel sequencing until 10 August 2012.* This offer includes the following analyses: Filtering and annotation of variants with dbsnp and consequence prediction with SIFT and PolyPhen Multi-genome comparison (e.g. trio analysis) Cancer exome analysis (i.e. identification of somatic mutations, CNV analysis, and LOH analysis in paired tumour/normal samples) For more information or to request a quote, contact OGT at contact@ogt.co.uk or call +44 (0)1865 856826. * Offer applicable to all 1 st orders of OGT s Genefficiency Targeted Sequencing Service including both capture and sequencing. Offer expires on 10 August 2012. Please enquire for full details and pricing. Oxford Gene Technology T: +44 (0)1865 856826 E: services@ogt.co.uk W: www.ogt.co.uk This document and its contents are Oxford Gene Technology IP Limited 2012. All rights reserved. OGT, Genefficiency and Oxford Gene Technology are trademarks of Oxford Gene technology IP Limited.