Next Generation Sequencing Technologies in Microbial Ecology Frank Oliver Glöckner 1
Max Planck Institute for Marine Microbiology Investigation of the role, diversity and features of microorganisms Interactions with physical and chemical processes in marine and other aquatic habitats Founded 1992 in Bremen, Germany 2
Marine Microbiology at MPI a Holistic Approach Who is out there and How much of which kind? What are they doing and Under which conditions are they doing what? 3
Promise of NGS: Much Denser Network of Data Phylogenetic diversity Qualitative data Quantitative data Organisms Environment Functional diversity Functional inventory Operon structures Expression profiles x, y, z, t Environmental descriptors -> Integrated datasets Genes 4
Data Integration www.megx.net Kottmann et al., NAR, submitted 5
Ecological Genomics The Vision Statistics Key parameters Ecosystems Biology Modelling Predictions 6
Ribosomal RNA as a universal marker gene Full cycle rrna-approach sample extracted nucleic acids DNA rrna nucleic acid probe rdna clones Pyrosequencing rdna sequences comparative analysis rdna dataset Amann, 1995 hybridization phylogeny 7
Diversity Analysis Sample Clone lib 100-500 2-3 month PCR High diversity Pedros-Alio, Trends in Microbiology, 2006, vol. 12, issue 6, page 257 8
Diversity Analysis Sample Clone lib 100-500 2-3 month PCR High diversity Tags 10,000-50,000 1 week 9
Problems Processing the data Accuracy/Quantitative? DNA/RNA extraction Multiple Operons Technical replicates Noise (sequencing errors ) 10
SILVA Databases Specifications Comprehensive & Aligned Bacteria, Archaea, Eukarya SSU, LSU Regularly updated Quality first Quality management Transparent process documentation Integrative Nomenclature Taxonomy Cultured, Typestrains Habitat (r100) 11
Growth of rrna databases (RDP & SILVA) 1000000 Growth of SSU ribosomal RNA databases (RDP II & SILVA) www.arb-silva.de 995747 900000 800000 756668 700000 rrna Sequences 600000 500000 400000 Comprehensive ribosomal RNA databases www.arb-silva.de 504295 300000 286257 200000 100000 0 473 1379 2251 2849 2849 4332 6205 7322 16277 16277 60274 83960 101781 194696 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Pruesse et al. NAR 2007, vol. 35, no 21, page 7188 2005 2006 2007 2008 SILVA 100 12
SILVA SSURef 100: Fully classified guide tree www.arb-silva.de 13
ARB Software Suite www.arb-home.de A Software Environment for Sequence Data Ludwig et al. NAR, 2004 ARB 5.0, 64 bit version released on 04. September 2009 14
Problems Processing the data Accuracy/Quantitative? DNA/RNA extraction Multiple Operons Noise (sequencing errors ) Technical replicates 15
Accuracy The rare biosphere : a reality check Reeder and Knight, 2009, Nature Methods vol. 6, no. 9, p. 636 16
SILVA SSUParc 100, Sequence Length Distribution www.arb-silva.de 100000 Comprehensive ribosomal RNA databases 90000 80000 70000 rrna Sequences 60000 50000 40000 30000 20000 10000 0 2000 1900 1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 Length (bases) 17
Technical Replicates I Helgoland Sample 11.02.2009, 454 Ti 1/2 PTP Gomez-Alvarez et al., 2009, ISME Journal, pages 1-4 18
Technical Replicates II Helgoland Sample 14.04.2009, 454 Ti 1/2 PTP 19
Unclassified Viridiplantae TA06 Spirochaetes SHA-109 Rhodophyta et al. Planctomycetes ML635J-21 Metazoa Lentisphaerae Gemmatimonadetes Gammaproteobacteria_4 20 Gammaproteobacteria_3 Gammaproteobacteria_2 Gammaproteobacteria_1 Gammaproteobacteria Fusobacteria Technical Replicates III - Dereplication 25,0% SSU Dereplicated 20,0% 15,0% 10,0% 5,0% 0,0% Fungi Firmicutes Euryarchaeota Euglenozoa Epsilonproteobacteria Deltaproteobacteria Deferribacteres Cyanobacteria Crenarchaeota Chloroflexi Chlorobi Candidate division WS6 Candidate division WS3 Candidate division WS1 Candidate division TM6 Candidate division SR1 Candidate division OP8 Candidate division OP3 Candidate division OP11 Candidate division OD1 Candidate division BRC1 Betaproteobacteria BD1-5 Bacteroidetes Amoeba Actinobacteria Acidobacteria
A Bioinformatic Workbench for Ecological Genomics Comprehensive ribosomal RNA databases A Software Environment for Sequence Data Organisms Environment Genes 21
Functional Diversity Analysis Sample Fosmids Random sequencing 10-100 ~ 400-4000 ORFs NGS 1000 40,000 ORFs DNA 40,000 2-3 month End sequencing 20,000 ORFs High diversity 22
Fosmid Sequencing Fosmids: 42 PTP: 1/4, 454 Ti min.: 101 bp max.: 48,265 bp mean: 8,085 bp Costs: 3000 Euro 48 contigs > 10 kb 23
Functional Diversity Analysis Sample Tags DNA 500,000-1 Mio 1 week Statistics BLAST COGs Classify Assembly? High diversity 24
Assembly 25
Are we prepared for the Data Flood? 26
Computing Infrastructure Cooling - Liquid cooling with 5,000 L/h at 8 C Power - 25-30 kw at peak - installed: 40 kva Storage - 8 TByte RAID file server - 4 TByte RAID database server Computers -43 cluster nodes - several larger servers -400 CPU cores 27
Moore s Law - Outcompeted 28
Take Home Message for the Next Generation Biologists Three languages! Mother tongue English Perl, Phyton Data management Garbage in -> garbage out! Standardisation www.gensc.org 29
The Group http://www.microbial-genomics.de Thanks for your attention 30