DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Similar documents

Innovations in Molecular Epidemiology

Pairwise Sequence Alignment

Chapter 6 DNA Replication

Next Generation Sequencing: Technology, Mapping, and Analysis

Forensic DNA Testing Terminology

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Introduction to Bioinformatics 3. DNA editing and contig assembly

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

European Medicines Agency

Introduction to Phylogenetic Analysis

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genetics Module B, Anchor 3

Bob Jesberg. Boston, MA April 3, 2014

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Localised Sex, Contingency and Mutator Genes. Bacterial Genetics as a Metaphor for Computing Systems

escience and Post-Genome Biomedical Research

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

MUTATION, DNA REPAIR AND CANCER

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

MATCH Commun. Math. Comput. Chem. 61 (2009)

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Chapter 13: Meiosis and Sexual Life Cycles

CCR Biology - Chapter 9 Practice Test - Summer 2012

Replication Study Guide

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Genetics Test Biology I

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Human Genome and Human Genome Project. Louxin Zhang

Practice Problems 4. (a) 19. (b) 36. (c) 17

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Becker Muscular Dystrophy

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Mechanisms of Evolution

BioBoot Camp Genetics

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Human Genome Organization: An Update. Genome Organization: An Update

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Bio-Informatics Lectures. A Short Introduction

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

LECTURE 6 Gene Mutation (Chapter )

Lecture 2: Mitosis and meiosis

MAKING AN EVOLUTIONARY TREE

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Gene Therapy and Genetic Counseling. Chapter 20

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

GenBank: A Database of Genetic Sequence Data

Biological Sequence Data Formats

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Amino Acids and Their Properties

The Human Genome Project

GenBank, Entrez, & FASTA

Biological Sciences Initiative. Human Genome

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

FREQUENTLY ASKED QUESTIONS ABOUT ADMISSIONS TO THE MS HUMAN GENETICS AND GENETIC COUNSELING PROGRAM Updated September 2015

SNP Essentials The same SNP story

Linear Sequence Analysis. 3-D Structure Analysis

Commonly Used STR Markers

Gene mutation and molecular medicine Chapter 15

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Chapter 5: Organization and Expression of Immunoglobulin Genes

HYPOTHESIS TESTING: POWER OF THE TEST

Bioinformatics Resources at a Glance

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Genetomic Promototypes

Answer Key Problem Set 5

Models of Switching in Biophysical Contexts

Genetics Lecture Notes Lectures 1 2

Structure and Function of DNA

PRACTICE TEST QUESTIONS

Milk protein genetic variation in Butana cattle

Worksheet - COMPARATIVE MAPPING 1

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

1 Mutation and Genetic Change

Protein Prospector and Ways of Calculating Expectation Values

Chapter 13: Meiosis and Sexual Life Cycles

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

On Covert Data Communication Channels Employing DNA Steganography with Application in Massive Data Storage

Fact Sheet 14 EPIGENETICS

Transcription:

DNA Insertions and Deletions in the Human Genome Philipp W. Messer

Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3. DNA insertions / deletions (indels)

Outline 1. Origin and characteristics of indels 2. Indels in protein-coding regions 3. Duplications and genomic correlations 4. Duplications and alignment scores

Part 1 Origin and Characteristics of Indels in the Human Genome [Messer and Arndt, Mol Biol Evol 2007]

Identifying Insertions/Deletions Time ~5 Myr Human Chimp Rhesus High-quality flanks H:..ATACCTCGTACAATAGCGCTGGTACAGATA.. C:..ATACCTCGTACAAT--CGCTGGTACAGGTA.. R:..ATACCTCGTACAAT--CGCTGCTACAGATA.. Insertions can be distinguished from deletions (parsimony)

Insertion/Deletion Statistics Insertions: ~250 000 events 15% SSR Deletions: ~450 000 events 5% SSR

Molecular Mechanisms 1. Unequal Crossing Over (UCO) nonhomologous recombination 2. Replication Slippage (RS) slipped-strand mispairing

Indel Signatures for UCO and RS Insertions: Tandem duplications of preexisting duplicates Deletions: Remove one copy of preexisting duplicates

trace extension d insertion length l

Indel Trace Extensions UCO, RS (insertion) UCO, RS (deletion) Tandem duplication Random indel

Measured Trace Extensions (l=8 bp)

Chromosomal Rate Differences 800 Rates (bp / Mbp) 600 400 200 Autosomes Chr X Chr Y 0 Insertions Deletions 1. Indels occur preferentially in the male germline 2. Indels are not recombination-mediated

Indel Characteristics 1. The majority of insertions are tandem duplications 2. Long preexisting duplicates are often missing 3. Indels occur preferentially in the male germline 4. Indels are not recombination-mediated

Nonhomologous End Joining DNA break End joining Small or no homology required Filling in of single strands

Part 2 Indels in Protein-coding Regions of the Human Genome [Chaux, Messer, Arndt, BMC Evol Biol (submitted)]

Indel Rates in Coding Regions

Genetic Code

Inserted/Deleted Amino Acids

Physico-chemical Properties

Physico-chemical Properties suppressed intensified

Protein Secondary Structure suppressed

Part 3 Tandem Duplications and Genomic Correlations [Messer, Arndt, Lässig, Phys Rev Lett 2005] [Messer, Lässig, Arndt, J Stat Mech 2005] [Messer, Arndt, Nucleic Acid Res 2006]

Sequence Evolution Model Mutation Segmental Duplication length l A C A A G T C C A rate µ rate δ l A T A A G T C C AGA T C C A Random Insertion Segmental Deletion length l A T T G T C C A rate γ l + rate γ l - A GA G A C T T A length l

The Correlation Function Measures likelihood of finding two G/C base pairs separated by a distance r along the genome ( ) = (, + ) 2 ( ) C r P x x r P x G/C G/C G T A T G A T C G A G A A Position: x x+r

Types of Correlation Behavior Random sequence Local correlations Long-range correlations Cr () 0 ( r r) Cr () exp / 0 Cr () r α Noise fluctuations Characteristic scale Generated e.g by Markov-processes Scale free, fractal Generated by a nontrivial dynamical model

Calculation of C(r) for Our Model Approach: continuous time Master Equation formalism Exact equation for the dynamics of C(r) Stationary solution in a continuum limit: C( r) r α with α = 4µ eff λ

Correlations in Genomic DNA Cr () r α distance r distance r correlation C(r) correlation C(r)

Part 4 Tandem Duplications and Alignment Score Statistics [Messer, Bundschuh, Vingron, Arndt, RECOMB 2006] [Messer, Bundschuh, Vingron, Arndt, JCB 2007]

Alignment Score Statistics Sequence alignment Seq1: ACCTAGTGCTA Significance Seq2: ATCTAGTGATA P-values of scores Requires DNA null model Standard iid model Problem Score distribution in the null model iid model correlated model Correlations in DNA Incorporate into null model P-values change e λs

Gaussian Approximation Analytic approach to calculate alignment score statistics for null models with LRC sequences, i.e. C(r) r -α 2 s λ = σ 2 + c ζ (2 α ) vanishes for iid sequences LRC s increases probability of finding high alignment scores by chance

Numerical Verification Score distribution Decay parameter λ lrc (α=0.5) lrc (α=1.0) iid Gaussian approximation captures qualitative behavior

Biological Significance Alignment of random sequences with same correlation parameters as human chr. 22 λ iid 1.37 λ lrc 1.21 P lrc 3x10-5 P iid 2x10-6 Difference in λ is approx.16 %, p-value increase > 10 x

Summary 1. The majority of short DNA insertions are tandem duplications 2. Amino acid insertions/deletions are less deleterious than substitutions 3. Tandem duplications cause long-range correlations in genomic base composition 4. These correlations have profound impact on the statistics of sequence alignment scores

Acknowledgments Peter Arndt Nicole de la Chaux Paz Polak Federico Squartini Ralf Bundschuh Michael Lässig Martin Vingron