All your base(s) are belong to us

Similar documents
2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Forensic DNA Testing Terminology

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Next Generation Sequencing

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Genomes and SNPs in Malaria and Sickle Cell Anemia

The Techniques of Molecular Biology: Forensic DNA Fingerprinting

- In , Allan Maxam and walter Gilbert devised the first method for sequencing DNA fragments containing up to ~ 500 nucleotides.

CCR Biology - Chapter 9 Practice Test - Summer 2012

Mitochondrial DNA Analysis

Genetics Test Biology I

DNA and Forensic Science

A Primer of Genome Science THIRD

Gene Mapping Techniques

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

Genetics Module B, Anchor 3

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Teacher s Guide - Feature Showcase. Forensic Science. Grades: 6-8 Content Area: Science

SNP Essentials The same SNP story

DNA Sequencing & The Human Genome Project

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Recombinant DNA and Biotechnology

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Structure and Function of DNA

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Single Nucleotide Polymorphisms (SNPs)

How many of you have checked out the web site on protein-dna interactions?

Crime Scenes and Genes

14.3 Studying the Human Genome

Basic Concepts Recombinant DNA Use with Chapter 13, Section 13.2

Human Genome and Human Genome Project. Louxin Zhang

IIID 14. Biotechnology in Fish Disease Diagnostics: Application of the Polymerase Chain Reaction (PCR)

DNA Technology Mapping a plasmid digesting How do restriction enzymes work?

Gene mutation and molecular medicine Chapter 15

Genetic Technology. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Next Generation Sequencing: Technology, Mapping, and Analysis

Beginner s Guide to Real-Time PCR

Answer: 2. Uracil. Answer: 2. hydrogen bonds. Adenine, Cytosine and Guanine are found in both RNA and DNA.

The Human Genome Project

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

CHAPTER 6: RECOMBINANT DNA TECHNOLOGY YEAR III PHARM.D DR. V. CHITRA

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

July 7th 2009 DNA sequencing

Biological Sciences Initiative. Human Genome

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Basic Concepts of DNA, Proteins, Genes and Genomes

Introduction to next-generation sequencing data

restriction enzymes 350 Home R. Ward: Spring 2001

Commonly Used STR Markers

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

CURRICULUM GUIDE. When this Forensics course has been completed successfully, students should be able to:

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

DNA, RNA, Protein synthesis, and Mutations. Chapters

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

Chapter 11: Molecular Structure of DNA and RNA

A Genomic Timeline Tim Shank 2003

Big data in cancer research : DNA sequencing and personalised medicine

Introduction to NGS data analysis

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

How DNA Evidence Works

escience and Post-Genome Biomedical Research

Illumina Sequencing Technology

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

MAKING AN EVOLUTIONARY TREE

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

Transfection-Transfer of non-viral genetic material into eukaryotic cells. Infection/ Transduction- Transfer of viral genetic material into cells.

G E N OM I C S S E RV I C ES

Willmar Public Schools Curriculum Map

Next Generation Sequencing

DNA PROFILING IN FORENSIC SCIENCE

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Human Genome Organization: An Update. Genome Organization: An Update

PRACTICE TEST QUESTIONS

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

DNA. Discovery of the DNA double helix

History of DNA Sequencing & Current Applications

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Genomics GENterprise

To be able to describe polypeptide synthesis including transcription and splicing

Chapter 6 DNA Replication

Lab 5: DNA Fingerprinting

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Intro to Bioinformatics

GenBank: A Database of Genetic Sequence Data

HIV NOMOGRAM USING BIG DATA ANALYTICS

Ten years ago marked the official end of the Human Genome Project, but really it was just the beginning

Use of the Agilent 2100 Bioanalyzer and the DNA 500 LabChip in the Analysis of PCR Amplified Mitochondrial DNA Application

Core Facility Genomics

Bob Jesberg. Boston, MA April 3, 2014

Next generation DNA sequencing technologies. theory & prac-ce

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

12.1 The Role of DNA in Heredity

Transcription:

All your base(s) are belong to us The dawn of the high-throughput DNA sequencing era 25C3 Magnus Manske

The place Sanger Center, Cambridge, UK

Basic biology Level of complexity Genome Single (all chromosomes chromosome of an organism) Public domain images from Wikimedia Commons unless noted otherwise DNA Image by Madeleine Price Ball [CC-BY-SA/GFDL]

Genomes and DNA DNA written as sequence of first letters of Adenine, Cytosine, Guanine, Thymine e.g....aggttcgagagccgcatgagac... Each letter called nucleotide or base Genome size does not necessarily correspond to organism size/complexity Organism Genome size Order of magnitude Bacteriophage MS2 3569 bases 103 Yeast 12 million 107 Human 3.2 billion 109 Amoeba dubia 670 billion 1011

Sequencing Episode 1 First sequenced genome: Bacteriophage MS2 (1970s by Walter Fiers) 3569 bases, using several man-decades RNA sequencing (RNA genome) Early DNA sequencing methods e.g. wandering spot (Gilbert and Maxam) yielding 24 base pairs per experiment Chain-termination sequencing method developed by Frederick Sanger in 1975 first rapid DNA sequencing method

Chain termination sequencing Primer DNA Radioactive gels John Schmidt [GFDL] Fluorescent dyes Sequences 300-1.000 bases en block Needs primer to define a starting point

Human Genome Project (1990-2003) Whole Genome Shotgun Sequencing Craig Venter PLoS [CC-BY-2.5] * Sequencing starts at random position * Little need for scientific supervision * Scale up by simply adding more machines and operators * Use computers to reassemble the genome

Solexa new technology sequencing Chain termination method : Few, long reads Solexa : Short reads, but buckets of them (shotgun) DNA fragmentation DNA fragments are affixed on a glass chip Single fragments are amplified in place to clusters Sequencing through amplification using fluorescent dyes Detection through digital photography

Some pictures Pictures courtesy of Harold Swerdlow

Data processing Each run produces ~4TByte of image data Processed into sequence data and quality estimate Long lists of this: @IL4_1106:1:1:539:515 GTACTATATATATAATATAAATATATATATAATAACA GCATTATAGAAAATATAAATATAAACATTTTTATGTT + >>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<;; <<<<<<<<<<<<<<<<<<<9<<<<<<<<<<<<<<<<< ca. 200 bases missing

New tech? So what? All DNA sequences so far stored in GeneBank 200 gigabases (billion bases) at the end of 2007 the sum of 35 years of world-wide sequencing In the first three months of 2008, the Sanger Center produced more than 200 gigabases of new sequencing data With 28 Solexa machines and 20 people Today: Capacity at Sanger is 7.5 terabases/year ~20 gigabases per run per machine Will probably double during next year

A paradigm shift Until this year : I need my gene sequenced! This year : WTF should I do with all the sequencing data? Reassembly is not fully automated New genomes need annotation Automatic annotation does not work well Manual annotators don't scale well

Fun with data storage and processing Output of all sequencing machines ~300 TByte/month Current storage capacity at Sanger ~3 PByte Backup is a problem it will take more time to restore backups from tape than to rerun the sequencing Currently ~4000 CPU cores in computing farm for image analysis / (re)assembly / etc.

What we use this for couple'o worms Fun stuff: zebrafish Mouse Human Gorilla Pig Neanderthal Woolly mammoth yet oh so many bacteria some plants trypanosomes malaria Platypus (alive!)

OK, how much for my genome? Human Genome Project: 1 human genome = $500M Current chain termination sequencing: 1 human genome = $10M Current Solexa sequencing: 1 human genome (20x) : $100K Cheap enough e.g. for the 1000 genomes project

That still sounds expensive! But company XYZ says they'll do it for $400! They don't do sequencing they do genotyping Checking for Single Nucleotide Polymorphisms A chip contains ~1 million known DNA variants e.g. for genetic diseases, phenotypic markers Cheaper, faster, and easier than sequencing Finds only known variants High error rate when used on individuals

GATTACA sooner than you think? Pacific Biotech Single enzyme sequencing 20 nm hole in a chip Multiplexed One human genome in a few minutes for ~ $100 Announced for 2010 Helicos Single molecule sequencing Nanopore Detects DNA as it passes through a nm-sized pore

Some UK news in context Forensic scientists could use DNA retrieved from a crime scene to predict the surname of the suspect, according to a new British study. (BBC, 2006) UK Police now hold DNA 'fingerprints' of 4.5m Britons [~7.5% of the population] taken from individuals who were not charged with any offence, and have no criminal record (Mail, 2007) Police may be given power to take DNA samples in the street [from people for] littering, speeding or not wearing a seat belt (The Guardian, 2007) UK DNA Database Found To Violate Human Rights (European Court of Human Rights, 2008)

Like any technology... it can be used for good and evil Malaria : 300-500 million infections per year 1-2 million deaths, mostly children under the age of 5 Sequencing Plasmodium falciparum : 95 individual genomes now

Malaria sample distribution 3(UK) 1 (China) 3 (S.E. Asia) Honduras (1) El Salvador (1) 17 (Thailand) 20 (Cambodia) 1 (Vietnam) Brazil (1) Columbia (1) Gambia (3) Senegal (1) 16 (Mali) 2 (Ghana) 1 (Sierra Leon) 10 2 (Kenya) (Sudan) 8 (Tanzania) 10 Papua New Guinea

Variation clustering African Papua New Guinea South East Asia 3D7 reference

Geographic visualizations Frequency PCA plots 22

Population networks

And much, much more... Multi-clonal infections Parasite travelling times and routes Insertions, deletions, inversions Copy number variation Detection of completely new variation (new genes etc.)

UAG (the end)