DNA Sequencing & The Human Genome Project



Similar documents
Next Generation Sequencing

Introduction to next-generation sequencing data

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Next Generation Sequencing: Technology, Mapping, and Analysis

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

An Overview of DNA Sequencing

Concepts and methods in sequencing and genome assembly

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Forensic DNA Testing Terminology

History of DNA Sequencing & Current Applications

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Next generation DNA sequencing technologies. theory & prac-ce

July 7th 2009 DNA sequencing

Basic Concepts of DNA, Proteins, Genes and Genomes

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

- In , Allan Maxam and walter Gilbert devised the first method for sequencing DNA fragments containing up to ~ 500 nucleotides.

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Illumina Sequencing Technology

Universidade Estadual de Maringá

Biological Sciences Initiative. Human Genome

How many of you have checked out the web site on protein-dna interactions?

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Next Generation Sequencing for DUMMIES

14.3 Studying the Human Genome

Human Genome Organization: An Update. Genome Organization: An Update

DNA Sequence Analysis

1865 Discovery: Heredity Transmitted in Units

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Human Genome and Human Genome Project. Louxin Zhang

Structure and Function of DNA

Chapter 6 DNA Replication

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Genetics Lecture Notes Lectures 1 2

DNA, RNA, Protein synthesis, and Mutations. Chapters

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

The Techniques of Molecular Biology: Forensic DNA Fingerprinting

Big data in cancer research : DNA sequencing and personalised medicine

CCR Biology - Chapter 9 Practice Test - Summer 2012

Data Analysis for Ion Torrent Sequencing

Sequencing the Human Genome

Introduction to NGS data analysis

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Mitochondrial DNA Analysis

restriction enzymes 350 Home R. Ward: Spring 2001

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

How is genome sequencing done?

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Introduction To Real Time Quantitative PCR (qpcr)

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Real-Time PCR Vs. Traditional PCR

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Reading DNA Sequences:

The Biotechnology Education Company

DNA and Forensic Science

First generation" sequencing technologies and genome assembly. Roger Bumgarner Associate Professor, Microbiology, UW

Molecular and Cell Biology Laboratory (BIOL-UA 223) Instructor: Ignatius Tan Phone: Office: 764 Brown

Computational Genomics. Next generation sequencing (NGS)

Overview of Next Generation Sequencing platform technologies

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

CHAPTER 6: RECOMBINANT DNA TECHNOLOGY YEAR III PHARM.D DR. V. CHITRA

Transcription and Translation of DNA

MUTATION, DNA REPAIR AND CANCER

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

RNA and Protein Synthesis

Gene Mapping Techniques

NGS data analysis. Bernardo J. Clavijo

Introduction to Bioinformatics 3. DNA editing and contig assembly

12.1 The Role of DNA in Heredity

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Translation Study Guide

Genomes and SNPs in Malaria and Sickle Cell Anemia

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

A Primer of Genome Science THIRD

DNA sequencing. Dideoxy-terminating sequencing or Sanger dideoxy sequencing

Essentials of Real Time PCR. About Sequence Detection Chemistries

DNA (genetic information in genes) RNA (copies of genes) proteins (functional molecules) directionality along the backbone 5 (phosphate) to 3 (OH)

HCS Exercise 1 Dr. Jones Spring Recombinant DNA (Molecular Cloning) exercise:

Recombinant DNA Unit Exam

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

Single Nucleotide Polymorphisms (SNPs)

Description: Molecular Biology Services and DNA Sequencing

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

The Human Genome Project

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Genetics Module B, Anchor 3

Core Facility Genomics

Techniques in Molecular Biology (to study the function of genes)

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

How To Understand The Science Of Genomics

Genomics GENterprise

Genetic Technology. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Gene mutation and molecular medicine Chapter 15

Transcription:

DNA Sequencing & The Human Genome Project An Endeavor Revolutionizing Modern Biology Jutta Marzillier, Ph.D Lehigh University Biological Sciences November 13 th, 2013

Guess, who turned 60 earlier this year!!!

Guess, who turned 60 earlier this year!!! The discovery of the structure of the DNA double helix: Published in 1953 by Watson and Crick in the journal of nature thereby launching the Genomic Age

Guess, who turned 10 earlier this year!!!

Human Genome Project 2001 Draft Human Genome Sequence 2003 Finished Human Genome (50 years after DNA structure solved) Two techniques published in 1977 by Sanger et al. DNA sequencing by chain termination or dideoxysequencing Maxam & Gilbert DNA sequencing by chemical modification The original method of Sanger sequencing and multiple improvements regarding chemistry and computation lead to complete sequencing of the 3 billion basepairs containing Human Genome (and many others). Sanger sequencing is expensive and Next-Generation-Sequencing (NGS) technology took its place

Genome Trivia How many base pairs (bp) are there in a human genome? How much did it cost to sequence the first human genome? How long did it take to sequence the first human genome? When was the first human genome sequence complete? Why is the Human Genome information important?

How many base pairs (bp) are there in a human genome? ~ 3 billion bp (haploid) How much did it cost to sequence the first human genome? ~ $ 3 billion How long did it take to sequence the first human genome? ~ 13 years When was the first human genome sequence complete? ~ 2000-2013 Why is the Human Genome information important?

Outline Dogma of Biology DNA and its Sequencing Biological Dogma and Principle of DNA synthesis Sanger sequencing and its improvements Next-Generation-Sequencing (NGS) Technologies 3 rd Generation Technologies What do we do with sequencing information?

Why is the Knowledge about the Human Genome interesting? http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml http://www.pharmainfo.net/files/images/stories/article_images/ The human body has about 100 trillion cells with more than 200 different cell types. Each cell harbors the same genetic information in its nucleus in form of DNA containing chromosomes. Depending on cellular, developmental, and functional stage of a cell only a subset of genes is expressed.

Why is Genome Sequencing Important?

Why is Genome Sequencing Important? To obtain a blueprint DNA directs all the instructions needed for cell development and function DNA underlies almost every aspect of human health, both, in function and dis-function To study gene expression in a specific tissue, organ or tumor To study human variation To study how humans relate to other organisms To find correlations how genome information relates to development of cancer, susceptibility to certain diseases and drug metabolism (pharmacogenomics) Outlook: Personalized Genomics (David Church, Harvard)

The establishment of various sequencing technologies gave rise to many new fields in biology, medicine and engineering - Identification of genes contributing to disease - synthetic biology - personalized medicine - genomic screening -.. http://jenniferdayan.girlshopes.com/brca1europe/#

What is DNA? DNA constitutes the heritable genetic information that forms the basis for the developmental programs of all living organisms. A genome is an organism s complete set of DNA Watson & Crick, 1953 The DNA is made up of four building blocks called nucleotides. DNA sequence is the particular side-by-side arrangement of bases along a DNA strand. DNA sequencing is a biochemical method to determine the sequence of the nucleotide bases that make up the DNA.

Principle of DNA Synthesis Arthur Kornberg demonstrated DNA replication in a cell-free (in vitro) bacterial extract (Nobel prize, 1959) -Discovered DNA polymerase (Pol1) to facilitate DNA synthesis - Unraveled the mechanism of DNA synthesis - nucleotide building blocks - a single DNA strand serves as a template - can only extend a pre-existing chain (primer) - a free 3 Hydroxyl end is required Watson et al., MGB, 2008

Reaction Mechanism for DNA synthesis The 3 hydroxyl group of the primer attacks the -phosphoryl group of the incoming nucleotide thereby forming a phosphodiester bond (S N 2 reaction). + H + Watson et al., MBG, 2008

Dideoxy Sequencing according to Sanger Frederick Sanger Nobel Prize (1980) Both nucleotide types can be incorporated into growing DNA chain. Presence of dideoxy-cytosine in growing chain blocks further addition of incoming nucleotides. Watson et al., MBG, 2008

Dideoxy Method of Sequencing (Sanger, 1975) DNA synthesis is carried out in the presence of limited amounts of fluorescently labeled dideoxyribonucleoside triphosphates that results in chain termination Through chain termination fragments of distinct sizes are generated that can be separated by gel electrophoresis fasciaworld.com.23/dideoxy-chain-termination

Separation of Sequencing Fragments by Capillary Gel Electrophoresis Capillary gel electrophoresis: Samples passing a detection window are excited by laser and emitted fluorescence is read by CCD camera. Fluorescent signals are converted into basecalls. www.mun.ca/biology/scarr/ High resolution Read length up to 1,000 nucleotides

ABI 310 Genetic Analyzer Capillary and electrode: Negatively charged DNA migrates towards anode DNA fragments labeled with different fluorescent dyes migrate according to their size past a laser

http://www.youtube.com/watch?v=srwvn1munma

High-Throughput Whole Genome Sequencing Analysis of 384 sequencing reactions in parallel George Church, Scientific American, January 2006, pp47-54

JGI Sequencing Facility (Joint Genome Institute, US Department of Energy) Assume more than 20 384-capillary sequencers running simultaneously approx. 700 bp per capillary run approx. 3 hours per run Approx. 40 Million bases per day in one facility

How was Sanger Genome Sequencing Done? Clone by Clone Shotgun sequencing Break genome into random fragments, insert fragments into vector, sequence each of the fragments and assemble fragments based on sequence overlaps Create a crude physical map of the whole genome by restriction mapping before sequencing Break the genome into overlapping fragments and insert them into BACs and transfect into E.coli need known flanking region to anneal primer

Finishing Draft Sequences using Assembly Software Primary sequencing reads: Align sequences for homologies Green: 3-fold coverage Yellow: 2-fold coverage Fragment assembly to contig Goal for Genome Sequencing: Sanger: 8-fold coverage 454: 30 fold coverage

De Novo Shotgun Sequencing and Assembly Genomic DNA is randomly fragmented into 300-800 bp fragments, size selected and sequenced. The generated sequences are then assembled into a number of unordered and unoriented contigs by the Newbler software http://454.com/applications/eukaryote-whole-genome.asp

Consensus Sequence TGCGGCTGCCAGATTTTGTACGGGTTTGGAAGTCGACGGAGAGAACAGCG CGGGCCTAGAAGGCCCCGTAATGCCCCCTGAGAGCCCCGTAGACGGACG AACGGTGCGGATCGATAGATGGCACCGGAGACAAGCGAAGACGGCCGCA GAGCCGTCGCCGGCTGACGCCCGCGTAGGAAGATATTCGTGTGAAGTGC GTCACATTCTACGGGTGAAACGCGAAAGTGGAAGGTTCCTTACCTATGGAG GGGTAAGGGAGCGAGCTCCAGCGAGCGACCGCACCCCGACATAGGTTCT TGTCGGGGTAGTCGAACGGAGAGAGACTACCCCTTTTAGCGACCTCCGGT CGCCCAGGTAGGTACCGAACGATGAGAGAGGTACCTAGACCGTACAGGCC GGGGTTTATCCCCCGGCCGATACAGCATGGTCATTTTGGGTAGGTACGTTA CGTAAGCATCACTCACCAACAGAACCAGTGGTTACGTAACCGGGTACGTTA CGTACGACTAGATACGTAACAGAACCACTAACCCGTGGCCGCCCGAAGGC GGCCCGCAGCGGGTTACGTTTGTCAGGTATGTCACTAGGGAGGGTGAACT ATGAATCTCAAGGAGCATCAGAAGGAGCTTCTTCGTCTACTGCTCGTGTCT GAGTCGATGACGTATGAGCGTGCCTACTCGAAGGTGGAGACCATCCTGCA CCTCATCCAGTCCTTCGACGCGGAGAACTCGATAAGTGAGCTGGGAGTCA TCTGACCGGCGCGATCGTCTGCCGACCGACTGGCCTCGCATCCGCCGCG AGGTTCTGCGGGCGGCTGGCCACCGCTGCCAGATCCGCTACGCGGACAT CTGCACAGGGATGGCTACCGAGGTTGATCACGTCCGCTACCGCGACGAG GCGTCACCTCTGCAGGTGTCGTGCAGACCGTGCCATGCGCGGAAGTCCG CGATGGAAGGCGTTGCTCAGCGTGCGAAGCTGCGCGCGATGAAGAAGCG

Genome Annotation: Searching for Open Reading Frame by Identifying all possible Start and Stop Codons Start ATG Stop TAA, TAG, TGA AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCA AGCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGC GCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGG GACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCT GTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGT GCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCG CCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGA GCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCC CTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATT GTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACC Bioinformatics Attach biological information to it Nieremberg http://www.brooklyn.cuny.edu/bc/ahp/ BioInfo/graphics/GP.GeneticCode.GIF

How to identify protein encoding genes? http://genome.wellcome.ac.uk/assets/gen10000676.jpg

What did we learn from HGP? Less than 2% of the Human Genome codes for protein The human genome encodes for approx. 21,000 protein-coding genes The human genome sequence is almost exactly the same (99 %) in all people The genome size does not correlate with the number of estimated genes Organism Genome Size Est. Gene # E. coli 4.6 Mb 4,400 Humans have only about twice the number of genes as a fruit fly and barely more genes than a worm! Yeast 12.1 Mb 6,200 Roundworm 97 Mb 19,700 Fruit Fly 180 Mb 13,600 Rice 389 Mb 37,500 Human 3200 Mb 25,000

The Human Genome Project Sequence Represents a Composite Genome describing Human Variation Different sources of DNA were used for original sequencing Celera: 5 individuals; HGSC: many The term genome is used as a reference to describe a composite genome The many small regions of DNA that vary among individuals are called polymorphisms: Mostly single nucleotide polymorphisms (SNPs) Insertions/ deletions (indels) Copy number variation Inversions http://scienceblogs.com/digitalbio/2007/09/ genetic_variation_i_what_is_a.php#more genome.cshlp.org SNPs: the human genome has at least 10 million SNPs most of these SNPs contribute to human variation some of them may influence development of diseases, susceptibility to certain drugs, toxins, infectious agents

Causes of Sickle Cell Anemia and Cystic Fibrosis have been pinpointed to specific mutations in their Protein-Coding DNA scienceandsociety.emory.edu Cystic fibrosis patients have a deletion of three base pairs in the CFTR sequence Protein folds incorrectly and is marked for degradation http://evolution.berkeley.edu/evolibrary/article/ mutations_06 Red blood cells carrying mutant hemoglobin are deprived of oxygen http://learn.genetics.utah.edu/content/ disorders/whataregd/sicklecell/

The Race for the $1,000 Genome Human Genome Project (2001, initial draft): > $ 3 billion (includes development of technology) raw expenses estimated at $300 million Rhesus macaque (2006) $ 22 million By end of 2007: $ 1-2 million for full mammalian genome sequence Wanted:!!!!!! The $ 1,000 Genome!!!!!!!!! - low cost - high-throughput - high accuracy

Next Generation Sequencing (NGS) Focuses on Miniaturization and Parallelization Sanger Cyclic Array Sequencing (454 Pyrosequencing, Life Technologies SOLID Illumina Hi-Seq 2000) Sequence from both ends Easier library preparation Fragment amplification on solid surfaces Nucleotide-to-nucleotide sequence determination Generate dense planar array of DNA features Apply cycles of enzymatic driven biochemistry Imaging-based data collection

Next generation sequencing 454 method (Pyrosequencing) -Shear DNA into 300 to 400 nte fragments -Ligate to heterobifunctional adapters -Capture DNA strands on beads and amplifiy by emulsion PCR -Place beads with amplified DNA fragments into picotiter plate (1.7 Mio wells) - Perform pyrosequencing reaction 44 m

Solexa/ Illumina Sequencing Technology Illumina reversible Sanger reversible Use a set of deoxynucleotides that carry - each a fluorescent label that can be cleaved off - a reversibly terminating moiety at the 3 hydroxyl position

Ion Torrent Sequencing Ion semiconductor sequencing exploits these facts by determining if a hydrogen ion is released upon providing a single species of dntp to the reaction. based on the detection of hydrogen ions that are released during the polymerization of DNA

BGI The Sequencing Factory Purchased 128 HiSeq2000 sequencers from Illumina in January 2010 each of which can produce 25 billion base pairs of sequence a day

Sudden and Profound Out-Pacing of Moore's Law Transition from Sanger Sequencing to Next-Generation Sequencing technologies Moore's Law, which describes a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years www.genome.gov/sequencingcosts.

Single DNA Strand Sequencing (Pacific Biosciences) Reportedly generating sequences an average of 1,500 bp long Nature, Vol 470:155. 2011

DNA Single Molecule Sequencing through Graphene Nanopore Measuring the electric signature of each type of DNA base while pulled through the nanopore within an electrical field Nature, Vol 467:164-165. 2010

GWAS Genome Wide Association Study an approach which identifies gene loci associated with a particular trait analyze genetic differences between people with specific illnesses, such as diabetes or heart disease, compared to healthy individuals. The purpose of the studies is to explore the connection between specific genes, known as genotype information, and their outward expression, known as phenotype information Open access repository

genetic variations found in the populations analyzed were categorized by how frequently they appeared in the individuals tested - more than 5% : common variants - between 5 to 0.5 %: low frequency - less than 0.5%: rare variant 1000 Genomes Project Nature, Nov 1 st, 2012, Vol 491, pg 56 Sequenced the whole genomes from a total of 1092 individuals from a pool of 14 different populations selected by their ancient migratory history and genetic relationship to each other Donors not known to have any diseases Study provides the genomic background to understand which genetic variations are within the normal range

1000 Genome Project GOAL: identify the SNPs that are present at 1% or greater frequency At the genetic level any two people are more than 99% alike But rare variants -- those that occur with a frequency of 1 percent or less in a population -- are thought to contribute to rare diseases as well as common conditions like cancer, heart disease and diabetes All participants submitted anonymous samples whole genome sequenced 5x protein coding genes sequenced 80 x - Discovered a total of 38 million SNPs - 1.4 million short stretches of insertions or deletions - 14,000 large DNA deletions The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, October 31, 2012 DOI: 10.1038/nature11632

ENCODE 6. September 2012 Build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active More than 80% of the human genome has at least one biochemical activity Many regulatory regions transcribed into RNA with regulatory function

The Cancer Genome Atlas (TCGA) Reveal commonalities between cancer types, invesitgate molecular abnormalities, and define mutations that are confined to specific tumors Development of prognostic, diagnostic and therapeutic strategies

Lessons learned Data sharing 2000: 4 eukaryotic genomes 2012: > 250 eukaryotic genomes, > 1000 human genomes Fewer protein-coding genes than thought Many regulatory RNA elements Studying of genetic variation and evolutionary origins Cost of sequencing has fallen 100,000 fold; high trough-put sequencing Obtaining a genome blueprint is not sufficient to explain the occurrence or susceptibility to disease, the combination of m a n y factors needs to be taken into account The combination of various approches (GWAS, 1000 Genomes, ENCODE, TCGA, and others will provide leads to the origin and treatment of human diseases.

Summary Sanger sequencing provided the basis to initiate and complete the Human Genome Project and many other genomes Next Generation Sequencing (NGS) with high trough-put and less costs took its place for genome sequencing The 454 Pyrosequencing method marketed by Roche in 2005 is able to sequence 400 to 600 million basepairs per run with 400-500 base pair read lengths (compared to Sanger of 350.000 base pairs with 800 base pair read lengths. There are other NGS technologies capturing the market (Applied Biosystems, Illumina, Ion Torrent, Pacific Biosciences) The ultimate goal is to sequence single-stranded DNA in real time