escience and Post-Genome Biomedical Research

Similar documents

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

Delivering the power of the world s most successful genomics platform

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Single Nucleotide Polymorphisms (SNPs)

Genomes and SNPs in Malaria and Sickle Cell Anemia

Information leaflet. Centrum voor Medische Genetica. Version 1/ Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

The NF1-gene a hotspot for de novo Alu- and L1-insertion?

Dr Alexander Henzing

Introduction to Bioinformatics 3. DNA editing and contig assembly

A Primer of Genome Science THIRD

Genetic diagnostics the gateway to personalized medicine

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE E15

Biological Sciences Initiative. Human Genome

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Human Genome Organization: An Update. Genome Organization: An Update

Bioinformatics Resources at a Glance

Sequencing and microarrays for genome analysis: complementary rather than competing?

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

CCR Biology - Chapter 9 Practice Test - Summer 2012

How To Understand The Science Of Genomics

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

European Medicines Agency

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

TRACKS GENETIC EPIDEMIOLOGY

A leader in the development and application of information technology to prevent and treat disease.

G E N OM I C S S E RV I C ES

Validation parameters: An introduction to measures of

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Evolution (18%) 11 Items Sample Test Prep Questions

Core Facility Genomics

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Next Generation Sequencing

Staphylococcus aureus Protein A (spa) Typing

Lecture 3: Mutations

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Module 1. Sequence Formats and Retrieval. Charles Steward

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future

BioBoot Camp Genetics

Genetics of Rheumatoid Arthritis Markey Lecture Series

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

NGS and complex genetics

Introduction to NGS data analysis

Analysis of NGS Data

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

ITT Advanced Medical Technologies - A Programmer's Overview

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Arabidopsis. A Practical Approach. Edited by ZOE A. WILSON Plant Science Division, School of Biological Sciences, University of Nottingham

TRANSLATIONAL BIOINFORMATICS 101

Roberto Ciccone, Orsetta Zuffardi Università di Pavia

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Genetic Algorithm. Based on Darwinian Paradigm. Intrinsically a robust search and optimization mechanism. Conceptual Algorithm

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

MUTATION, DNA REPAIR AND CANCER

Teaching Bioinformatics to Undergraduates

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

DNA-Analytik III. Genetische Variabilität

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298

Bioinformatics: course introduction

Human Genome and Human Genome Project. Louxin Zhang

NIH s Genomic Data Sharing Policy

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

GenBank: A Database of Genetic Sequence Data

BI122 Introduction to Human Genetics, Fall 2014

Challenges associated with analysis and storage of NGS data

Enhancing Functionality of EHRs for Genomic Research, Including E- Phenotying, Integrating Genomic Data, Transportable CDS, Privacy Threats

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

Cancer Genomics: What Does It Mean for You?

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Innovations in Molecular Epidemiology

Gene mutation and molecular medicine Chapter 15

Integration of genomic data into electronic health records

Site-Directed Nucleases and Cisgenesis Maria Fedorova, Ph.D.

Regulatory Issues in Genetic Testing and Targeted Drug Development

Cloud-Based Big Data Analytics in Bioinformatics

Localised Sex, Contingency and Mutator Genes. Bacterial Genetics as a Metaphor for Computing Systems

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Genomic CDS: an example of a complex ontology for pharmacogenetics and clinical decision support

Transcription:

escience and Post-Genome Biomedical Research Thomas L. Casavant, Adam P. DeLuca Departments of Biomedical Engineering, Electrical Engineering and Ophthalmology Coordinated Laboratory for Computational Genomics (CLCG) Center for Bioinformatics and Computational Biology (CBCB) http://genome.uiowa.edu 1

Outline 1. General Observations about Postgenomic escience 1. Specific Case Study where Traditional Publication and Archiving Practices are Lacking 1. Impact of Emerging Technology on Scale of the escience Problems 2

Post-Genomic escience The Post-Genome Era 3 Primary Types of Investigation 1. Generation of New High-Throughput Data (new Genome Projects ) 2. Generating New Data in the Context of Existing Results (Published or in Databases) 3. No New Data Exclusive re-use of Published Data 3

Case Study: Genomic Rearrangements or Deletions/Duplications (Dr. Krishna Rani Kalari, Mayo Clinic) 1. Goal: Identification of Human disease causing mutations 2. Observation: Assays exist to identify deletions and duplications time consuming laborious expensive 3. Approach: Develop In-silico procedures to identify and prioritize candidate deletion/duplication sites and accelerate the finding of disease mutation discovery 4

Approach Details 1. Construct case and control data sets for all known cases of disease causing unequal recombinations 2. Identify and obtain informative sequencebased features to create a training set 3. Evaluate machine learning methods on the training set 4. Design and develop a computational system to identify and prioritize candidate intragene deletions and duplications 5

Approach - System Level Obtain list of genes that have deletions/duplications Identify candidates From Published Literature Collect all the breakpoint sequences Case sequences Obtain annotation, melting Temperature and hapmap features for the DNA sequences Control sequences System/Analysis Predict deletion or duplication candidates 6

HGMD statistics 2362 genes have 64251 mutations 7 % (4,500) of the mutations in HGMD are caused by gross deletions and duplications. 7

HGMD mutation classification Mutation type Nucleotide substitutions (missene / nonsense) Total number of mutations Nucleotide substitutions (splicing) 46 Nucleotide substitutions (regulatory) 0 Small deletions 52 Small insertions 12 Small indels 1 Gross deletions 2 Gross insertions and duplications 0 Complex rearrangements (inversions) 1 Repeat variations 0 294

HGMD - Gross deletions Accession Number Description Phenotype Reference CG035110 ex. 18 (described at genomic DNA level) CG994802 36 bp nt. 6543 (described at genomic DNA level) Stargardt disease Yatsenko (2003) Hum Mutat 21, 636 Stargardt disease Lewis (1999) Am J Hum Genet 64, 422

Local Deletion Database 10

Of the 4,500 Possible Training Cases, How Many Did We Get??? Searched for specific break point information for 1463 IDDs described in HGMD Identified 102 fully-characterized rearrangement breakpoints (cases) know exactly where the breakpoint occurs Identified 2338 matching set of breakpoints for each of the positives for which IDDs have not been observed (controls) 11

SPeeDD web-interface 12

Lessons From Deletion Case Study Important results and data are buried in traditional forms of scientific publication and dissemination mechanisms (no surprise here). Fidelity and throughput of legacy results inadequate Necessary data can be requested from investigators In some cases Reference to a changing world of what is assumed to be known Stay tuned the problem will only get worse 13

Impact of Emerging Technology on Scale of the escience Problems 14

~3 genomes/week Genomes Online Database v2.0 www.genomesonline.org 15

How did we get here? Advances in genome sequencing were driven by the Human Genome Project Scale-up started in 1999 Resources concentrated in large genome centers Increase in capacity Reduction in cost Economies of scale Improved technology Sequencing infrastructure available for non-human projects 16

Genome Center Perspective (George Weinstock, Wash-U/Baylor GSCs) Research is Data-driven Produce more data Hypothesis generating > hypothesis testing Community resource projects Rapid data release; prepublication Etiquette in use of prepublication data No intellectual property contraints Production is Technology-enabled Develop or acquire new technologies 17

130:1

Human disease study 500 cases + 500 controls 500 genes, 15 exons/targets per gene 2 reads/target 15 million reads to screen 1,000 subjects 454: 10M rds/d or Solexa: 160M rds/d Conclusion: this is a small experiment 19

Project Jim Whole human genome Proof of Principle What can be learned from a single genome? What biases exist in the data? What analysis issues arise? Not a consensus sequence but need to capture both alleles: 6 GB not 3 GB Data quality vs variation: how do you know a variant base is a mutation and not an error 20

Conclusions: Post-genomic escience 1. Generation of New High-Throughput Data (new Genome Projects ) 2. Generating New Data in the Context of Existing Results (Published or in Databases) 3. No New Data Exclusive re-use of Published Data 21