Handling next generation sequence data

Similar documents
Overview sequence projects

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

How is genome sequencing done?

Bio-Informatics Lectures. A Short Introduction

Next Generation Sequencing

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Universidade Estadual de Maringá

Bioinformatics Resources at a Glance

History of DNA Sequencing & Current Applications

July 7th 2009 DNA sequencing

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Overview of Next Generation Sequencing platform technologies

Introduction to next-generation sequencing data

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

14/12/2012. HLA typing - problem #1. Applications for NGS. HLA typing - problem #1 HLA typing - problem #2

Next generation DNA sequencing technologies. theory & prac-ce

Next Generation Sequencing; Technologies, applications and data analysis

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

BRCA1 / 2 testing by massive sequencing highlights, shadows or pitfalls?

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

NGS data analysis. Bernardo J. Clavijo

Illumina Sequencing Technology

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Running a Bioinformatics Help Desk. Solved and Unsolved Problems

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Next Generation Sequencing; Technologies, applications and data analysis

A Tutorial in Genetic Sequence Classification Tools and Techniques

Bioinformatics Grid - Enabled Tools For Biologists.

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

escience and Post-Genome Biomedical Research

-> Integration of MAPHiTS in Galaxy

Next Generation Sequencing for DUMMIES

How many of you have checked out the web site on protein-dna interactions?

Computational Genomics. Next generation sequencing (NGS)

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Hadoopizer : a cloud environment for bioinformatics data analysis

Concepts and methods in sequencing and genome assembly

Sequencing the Human Genome

CONDOR as Job Queue Management for Teamcenter 8.x

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

An Overview of DNA Sequencing

DNA Sequencing Overview

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Cloud Ready for Bioinformatics?

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Genome Sequencer System. Amplicon Sequencing. Application Note No. 5 / February

Genomics GENterprise

Simon Miles King s College London Architecture Tutorial

Whole genome sequencing of foodborne pathogens: experiences from the Reference Laboratory. Kathie Grant Gastrointestinal Bacteria Reference Unit

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Algorithms for Next Generation Sequencing Data Analysis

A W orkflow Management System for Bioinformatics Grid

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

DNA Sequencing & The Human Genome Project

Molecular and Cell Biology Laboratory (BIOL-UA 223) Instructor: Ignatius Tan Phone: Office: 764 Brown

DNA Sequence Analysis

CLC Sequence Viewer USER MANUAL

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Biological Sequence Data Formats

Impact Analysis for Software changes in OpenLAB CDS A.01.04

Debian Med. Integrated software environment for all medical purposes based on Debian GNU/Linux. Andreas Tille. OSWC, Malaga Debian.

An agent-based layered middleware as tool integration

Reading DNA Sequences:

Bioinformatics I, WS 09-10, D. Huson, January 27,

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Institutional Partnership Program

Intro to Bioinformatics

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Transcription:

Handling next generation sequence data a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB Academic Medical Center Amsterdam

Very short intro on high throughput sequencing Sanger sequencing High throughput sequencing 23-01-2009 2

DNA building blocks >chr1 taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc taaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaa ccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaa ccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaac cccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaacc ctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccc taaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccct aaccctaaccctaaccctcgcggtaccctcagccggcccgcccgcccggg tctgacctgaggagaactgtgctccgccttcagagtaccaccgaaatctg tgcagaggacaacgcagctccgccctcgcggtgctctccgggtctgtgct gaggagaacgcaactccgccggcgcaggcgcagagaggcgcgccgcgccg gcgcaggcgcagacacatgctagcgcgtcggggtggaggcgtggcgcagg cgcagagaggcgcgccgcgccggcgcaggcgcagagacacatgctaccgc gtccaggggtggaggcgtggcgcaggcgcagagaggcgcaccgcgccggc gcaggcgcagagacacatgctagcgcgtccaggggtggaggcgtggcgca ggcgcagagacgcaagcctacgggcgggggttgggggggcgtgtgttgca ggagcaaagtcgcacggcgccgggctggggcggggggagggtggcgccgt gcacgcgcagaaactcacgtcacggtggcgcggcgcagagacgggtagaa cctcagtaatccgaaaagccgggatcgaccgccccttgcttgcagccggg cactacaggacccgcttgctcacggtgctgtgccagggcgccccctgctg gcgactagggcaactgcagggctctcttgcttagagtggtggccagcgcc ccctgctggcgccggggcactgcagggccctcttgcttactgtatagtgg tggcacgccgcctgctggcagctagggacattgcagggtcctcttgctca aggtgtagtggcagcacgcccacctgctggcagctggggacactgccggg ccctcttgctccaacagtactggcggattatagggaaacacccggagcat http://en.wikipedia.org/wiki/dna ATGCTGTTTGGTCTCAgtagactcctaaatatgggattcctgggtttaaa agtaaaaaataaatatgtttaatttgtgaactgattaccatcagaattgt actgttctgtatcccaccagcaatgtctaggaatgcctgtttctccacaa 23-01-2009 agtgtttacttttggatttttgccagtctaacaggtgaagccctggagat 3 tcttattagtgatttgggctggggcctggccatgtgtatttttttaaatt tccactgatgattttgctgcatggccggtgttgagaatgactgcgcaaat

Sanger sequencing Mix of "standard" nucleotides and labelled dideoxynucleotides (chain-terminating nucleotides) Gel or capillary sequencing One sequence at the time Robots: up to 384 samples in one run Bentley 2006, Curr opinion in genetics http://en.wikipedia.org/wiki/dna_sequencing 23-01-2009 4

Overview sequencing methods Synthetic chain-terminator chemistry (Sanger sequencing) Sequencing by hybridisation Pyrosequencing Base-by-base sequencing by synthesis Sequencing by ligation Nanopore technology Single-molecule sequencing by synthesis in real time 23-02-2009 5

People and labs Sequence facility Marja Jakobs Ted Bradley Frank Baas Laboratory Division Bioinformatics laboratory - KEBB Angela Luyf, Silvia D Olabarriaga Tristan Glatard Barbera van Schaik Antoine van Kampen Roche (454) sequencer 23-01-2009 6

Pyrosequencing - Roche FLX (454) Sample preparation Nature Methods 2007 advertisement 454.com 23-01-2009 7

Pyrosequencing - Roche FLX (454) Sequencing process Nature Methods 2008 23-01-2009 8

Pyrosequencing - Roche FLX (454) Summary One sequence per bead Amplification in oil/water emulsion Fix bead in container (picotiterplate) Put plate with containers in machine Wash one nucleotide at the time over plate -> light emission Take picture Wash next nucleotide over plate Nature Methods 2008 and 454.com 23-01-2009 9

Data pre-processing 10011 00101 01010 Binary to Fasta CACTC TGCGT >seqa CACTC CAGGA AACAG CGACA TGCGT >E9108QN01BVB2T length=238 xy=0649_3411 region=1 run=r_2008_05_06_17_51_52_ CACTCCAGGAAACAGCTATGACCTCTGCCTGGAAAGCCAGGTGCCTGTGGGCAGAGCCCA GGACCACAGGGCCAGGGGTATCTCGTGTTCCTGTCCTGGCCGCGGATCTTCTTCTCCATC TCAGCGTCTGTCAGAGTCTCCAGCAGTGGGCACCACTGGTCCGCATCGCCCGTGTTCCGG ATGGCAATCTCCACTGTGGGCAGAGGGTTCTCGCTACGAGGAGGGAGGCAGTGAGAGG 23-01-2009 10

Further analysis Mutation detection: BLAST against reference Virus discovery: BLAST against virus database Gene expression: BLA(S)T against gene reference set Chip-on-sequencing: BLA(S)T against genome sequence Pre processing BLAST Quality check BLAT Feature count 23-01-2009 11

High throughput sequence data explosion One sequence run: 2 GB (>400,000 reads) Per day: 6 GB (1,200,000 reads) Per week (5d): 30 GB (6,000,000 reads) Per year: 1500 GB (312,000,000 reads) This becomes worse when the Roche system is upgraded! 23-01-2009 12

Pilot: run bioinformatics tools on the Grid Experience with earlier projects Many computation intensive tasks This pilot: BLAST as (small) test case Advantages of Grid Sharing of data storage and computing power Parallel computing (multiple jobs at same time) Disadvantage of Grid Complex system to work with Currently bioinformatician friendly systems are available End-user interface for Grid usage Workbench for building workflows 23-01-2009 13 System to run workflows on the Grid

Outline Components Dutch Life Sciences Grid VBrowser Workflows Taverna Moteur GASW webservices Interaction between the components Discussion Experiences so far and considerations Wish list for Life Sciences Grid Current status and future work 23-01-2009 14

Interaction between the components Taverna workflow management system export import Scufl file (XML) VBrowser lsgrid 23-01-2009 15

Virtual Laboratory for escience Bioinformatics 23-01-2009 16 http://www.vl-e.nl/

Dutch Life Sciences Grid Roll out GRID infrastructure in the Netherlands Sharing of data storage and computer power 23-01-2009 17 http://www.biggrid.nl/

VBrowser SARA AMC 23-01-2009 18 http://www.vl-e.nl/vbrowser/

Workflows We want to create a bioinformatics pipeline for sequence analysis Modular building blocks that perform a single task Connect blocks to create a program Sequence files Pre-processing BLAST BLAST files 23-01-2009 19

Taverna http://www.ebi.ac.uk/tools/webservices/tutorials/workflow/taverna 23-01-2009 20 http://taverna.sourceforge.net/

Workflow management systems Difference between Taverna and Moteur Sequence files Sequence files AMC Pre-processing Pre-processing Job on Grid node BLAST BLAST NCBI BLAST files BLAST files Job on Grid node 23-01-2009 21

Generic Application Service Wrapper (GASW) Configuration files (XML) GASW services 23-01-2009 22 http://rainbow.i3s.unice.fr/wiki/dokuwiki/doku.php?id=public_namespace:moteur

Example config file for wrapping a perl script with GASW <description> <executable name="sff2fasta.pl"> <access type="lfn"/> <value value="/grid/lsgrid/angela/sequence_wf/michel_28_10_2008/perlscripts/sff2 fasta.pl"/> <input name="tarfile" option="no0"> <access type="lfn"/> </input> <input name="sfffile" option="no1"> <access type="lfn"/> </input> <output name="out_sff2fasta.txt" option = "no2"> <template value="/grid/lsgrid/angela/sequence_wf/michel_28_10_2008/sff2fasta_out/%s _fasta.fna"/> <access type="lfn"/> </output> <output name="out_sff2fasta.txt" option = "no3"> <template value="/grid/lsgrid/angela/sequence_wf/michel_28_10_2008/sff2fasta_out/%s

Interaction between the components Taverna workflow management system export import Scufl file (XML) VBrowser Grid certificate: I am me lsgrid 23-01-2009 24

Screenshot VBrowser/Moteur (1) 23-01-2009 25 http://rainbow.i3s.unice.fr/wiki/dokuwiki/doku.php?id=public_namespace:moteur

Screenshot VBrowser/Moteur (2) 26

Outline Components Dutch Life Sciences Grid VBrowser Workflows Taverna Moteur GASW webservices Interaction between the components Discussion Experiences so far and considerations Wish list for Life Sciences Grid Current status and future work 23-01-2009 27

Experiences so far and considerations Request certificate for the Life Sciences Grid Learn how all components work Wrap our applications for use in Grid workflows Ship databases and blast executables to the Grid 23-01-2009 28

Wish list for Life Sciences Grid related to sequence analysis Public databases GenBank Bioinformatics tools BLAST EMBOSS BioPerl 23-01-2009 29

Outline Components Dutch Life Sciences Grid VBrowser Workflows Taverna Moteur GASW webservices Interaction between the components Discussion Experiences so far and considerations Wish list for Life Sciences Grid Current status and future work 23-01-2009 30

Status Current status Wrapped with GASW: Perl scripts for pre-processing of Roche sequence data BLAST and BLAT Build a workflow of these components in Taverna Ran workflow successfully on the Life Sciences Grid with Moteur Future work Submit workflows for multiple sequence runs Real (computation intensive) application Examine how we can build a system for end-users NBIC Bioassist - sequencing platform 23-01-2009 31

Angela Luijf Bioinformatics Laboratory a.c.luyf@amc.nl Tristan Glatard Creatis-LRMN Lyon France glatard@creatis.insa-lyon.fr Barbera van Schaik Bioinformatics Laboratory b.d.vanschaik@amc.nl Silvia D Olabarriaga Bioinformatics Laboratory s.d.olabarriaga@amc.nl Frank Baas Neurogenetics Sequencing facility f.baas@amc.nl Antoine van Kampen Bioinformatics Laboratory a.h.vankampen@amc.nl