Contents. list of contributors. Preface. Basic concepts of molecular evolution 3

Nucleotide substitutions are considered a homogeneous process?
What is used to determine phylogenetic inference?

Transcription

1 list of contributors Foreword Preface page xix xxiii XXV Section I: Introduction Basic concepts of molecular evolution 3 Anne-Mieke Vandamme 1.1 Genetic information Population dynamics Evolution and speciation Data used for molecular phylogenetics What is a phylogenetic tree? Methods for inferring phylogenetic trees Is evolution always tree-like? 28 Section II: Data preparation 2 Sequence databases and database searching 33 Theory 33 Guy Bottu 2.1 Introduction Sequence databases General nucleic acid sequence databases General protein sequence databases Specialized sequence databases, reference databases, and genome databases Composite databases, database mirroring, and search tools Entrez The phylogenetic handbook 2009 digitalisiert durch: IDS Basel Bern

2 vi Sequence Retrieval System (SRS) Some general considerations about database searching by keyword Database searching by sequence similarity Optimal alignment Basic Local Alignment Search Tool (BLAST) FASTA Other tools and some general considerations 52 Practice 55 Marc Van Ranst and Philippe Lemey 2.5 Database searching using ENTREZ BLAST FASTA 66 Multiple sequence alignment 68 Theory Des Higgins and Philippe Lemey Introduction The problem of repeats The problem of substitutions The problem of gaps Pairwise sequence alignment Dot-matrix sequence comparison Dynamic programming Multiple alignment algorithms Progressive alignment Consistency-based scoring Iterative refinement methods Genetic algorithms Hidden Markov models Other algorithms Testing multiple alignment methods Which program to choose? Nucleotide sequences vs. amino acid sequences 3.10 Visualizing alignments and manual editing 96 Practice 100 Des Higgins and Philippe Lemey 3.11 CLUSTAL alignment File formats and availability Aligning the primate Trim5o/ amino acid sequences

3 vii 3.12 T-COFFEE alignment MUSCLE alignment Comparing alignments using the ALTAVisT web tool From protein to nucleotide alignment Editing and viewing multiple alignments Databases of alignments 106 Section III: Phylogenetic inference 109 Genetic distances and nucleotide substitution models 111 Theory 111 Korbinian Strimmer and Arndt von Haeseler 4.1 Introduction Observed and expected distances Number of mutations in a given time interval * (optional) Nucleotide substitutions as a homogeneous Markov process The fukes and Cantor (IC69) model Derivation of Markov Process *( optional) Inferring the expected distances Nucleotide substitution models Rate heterogeneity among sites 123 Practice 126 Marco Salemi 4.7 Software packages Observed vs. estimated genetic distances: the JC69 model Kimura 2-parameters (K80) and F84 genetic distances More complex models Modeling rate heterogeneity among sites Estimating standard errors using MEGA The problem of substitution saturation Choosing among different evolutionary models Phylogenetic inference based on distance methods 142 Theory 142 Yves Van de Peer 5.1 Introduction Tree-inference methods based on genetic distances Cluster analysis (UPGMA and WPGMA) Minimum evolution and neighbor-joining Other distance methods 156

4 viii 5.3 Evaluating the reliability of inferred trees Bootstrap analysis Jackknifing Conclusions 159 Practice 161 Marco Salemi Programs to display and manipulate phylogenetic trees Distance-based phylogenetic inference in PHYLIP Inferring a Neighbor-foining tree for the primates data set Outgroup rooting Inferring a Fitch-Margoliash tree for the mtdna data set Bootstrap analysis using PHYLIP Impact of genetic distances on tree topology: an example using MEGM Other programs Phylogenetic inference using maximum likelihood methods Theory Heiko A. Schmidt and Arndt von Haeseler Introduction The formal framework The simple case: maximum-likelihood tree for two sequences The complex case Computing the probability of an alignment for a fixed tree Felsenstein's pruning algorithm Finding a maximum-likelihood tree Early heuristics Full-tree rearrangement DNAML and FASTDNAML PHYML and PHYML-SPR IQPNNI RAxML Simulated annealing Genetic algorithms Branch support The quartet puzzling algorithm Parameter estimation ML step Puzzling step Consensus step Likelihood-mapping analysis

5 ix Practice 199 Heiko A. Schmidt and Arndt von Haeseler 6.8 Software packages An illustrative example of an ML tree reconstruction Reconstructing an ML tree with IQPNNI Getting a tree with branch support values using quartet puzzling Likelihood-mapping analysis of the HIV data set Conclusions 207 Bayesian phylogenetic analysis using MRBAYES 210 Theory 210 Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck 7.1 Introduction Bayesian phylogenetic inference Markov chain Monte Carlo sampling Burn-in, mixing and convergence Metropolis coupling Summarizing the results An introduction to phylogenetic models Bayesian model choice and model averaging Prior probability distributions 236 Practice 237 Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck 7.10 Introduction to MRBAYES Acquiring and installing the program Getting started Changing the size of the MRBAYES window Getting help A simple analysis Quick start version Getting data into MRBAYES Specifying a model Setting the priors Checking the model Setting up the analysis Running the analysis When to stop the analysis Summarizing samples of substitution model parameters Summarizing samples of trees and branch lengths 257

6 7.12 Analyzing a partitioned data set Getting mixed data into MRBAYES Dividing the data into partitions Specifying a partitioned model Running the analysis Some practical advice Phytogeny inference based on parsimony and other methods using PAUP* 267 Theory 267 David L Swofford and Jack Sullivan 8.1 Introduction Parsimony analysis - background Parsimony analysis - methodology Calculating the length of a given tree under the parsimony criterion Searching for optimal trees Exact methods Approximate methods 282 Practice 289 David L Swofford and Jack Sullivan 8.5 Analyzing data with PAUP* through the command-line interface Basic parsimony analysis and tree-searching Analysis using distance methods Analysis using maximum likelihood methods Phylogenetic analysis using protein sequences 313 Theory 313 Fred R. Opperdoes 9.1 Introduction Protein evolution Why analyze protein sequences? The genetic code and codon bias Look-back time Nature of sequence divergence in proteins (the PAM unit) Introns and non-coding DNA Choosing DNA or protein? Construction of phylogenetic trees Preparation of the data set Tree-building 329

7 xi Practice 332 Fred R. Opperdoes and Philippe Lemey 9.4 A phylogenetic analysis of the Leishmanial glyceraldehyde- 3-phosphate dehydrogenase gene carried out via the Internet A phylogenetic analysis of trypanosomatid glyceraldehyde- 3-phosphate dehydrogenase protein sequences using Bayesian inference 337 Section IV: Testing models and trees 10 Selecting models of evolution Theory David Posada Models of evolution and phylogeny reconstruction Model fit Hierarchical likelihood ratio tests (hlrts) Potential problems with the hlrts Information criteria Bayesian approaches Performance-based selection Model selection uncertainty Model averaging Practice David Posada The model selection procedure MODELTEST PROTTEST Selecting the best-fit model in the example data sets Vertebrate mtdna HIV-1 envelope gene G3PDH protein Molecular clock analysis 362 Theory 362 Philippe Lemey and David Posada 11.1 Introduction The relative rate test 364

8 xii 11.3 Likelihood ratio test of the global molecular clock Dated tips Relaxing the molecular clock Discussion and future directions 371 Practice 373 Philippe Lemey and David Posada 11.7 Molecular clock analysis using PAML Analysis of the primate sequences Analysis of the viral sequences Testing tree topologies 381 Theory 381 Heiko A. Schmidt 12.1 Introduction Some definitions for distributions and testing Likelihood ratio tests for nested models How to get the distribution of likelihood ratios Non-parametric bootstrap Parametric bootstrap Testing tree topologies Tree tests - a general structure The original Kishino-Hasegawa (KH) test One-sided Kishino-Hasegawa test Shimodaira-Hasegawa (SH) test Weighted test variants The approximately unbiased test Swofford-Olsen-Waddell-Hillis (SOWH) test Confidence sets based on likelihood weights Conclusions 395 Practice 397 Heiko A Schmidt 12.8 Software packages Testing a set of trees with TREE-PUZZLE and CONSEL Testing and obtaining site-likelihood with TREE-PUZZLE Testing with CONSEL Conclusions 403

9 xiii Section V: Molecular adaptation r 13 Natural selection and adaptation of molecular sequences 407 Oliver G. Pybus and Beth Shapiro 13.1 Basic concepts The molecular footprint of selection Summary statistic methods d^lds methods Codon volatility Conclusion Estimating selection pressures on alignments of coding sequences 419 Theory 419 Sergei L. Kosakovsky Pond, Art F. Y. Poon, and Simon D. W. Frost 14.1 Introduction Prerequisites Codon substitution models Simulated data: how and why? Statistical estimation procedures Distance-based approaches Maximum likelihood approaches Estimating ds and dn Correcting for nucleotide substitution biases Bayesian approaches Estimating branch-by-branch variation in rates Local vs. global model Specifying branches a priori Data-driven branch selection Estimating site-by-site variation in rates Random effects likelihood (REL) Fixed effects likelihood (FEL) Counting methods Which method to use? The importance of synonymous rate variation Comparing rates at a site in different branches Discussion and further directions 450 Practice 452 Sergei L Kosakovsky Pond, Art F. Y. Poon, and Simon D. W. Frost Software for estimating selection PAML ADAPTSITE

10 xiv MEGA HYPHY DATAMONKEY Influenza A as a case study Prerequisites Getting acquainted with HYPHY Importing alignments and trees Previewing sequences in HYPHY Previewing trees in HYPHY Making an alignment Estimating a tree Estimating nucleotide biases Detecting recombination Estimating global rates Fitting a global model in the HYPHY GUI Fitting a global model with a HYPHY batch file Estimating branch-by-branch variation in rates Fitting a local codon model in HYPHY Interclade variation in substitution rates Comparing internal and terminal branches Estimating site-by-site variation in rates Preliminary analysis set-up Estimating filot Ml Single-likelihood ancestor counting (SLAC) Fixed effects likelihood (FEL) REL methods in HYPHY Estimating gene-by-gene variation in rates Comparing selection in different populations Comparing selection between different genes Automating choices for HYPHY analyses Simulations Summary of standard analyses Discussion 490 Section VI: Recombination 13 Introduction to recombination detection 493 Philippe Lemey and David Posada 15.1 Introduction Mechanisms of recombination

11 xv 15.3 Linkage disequilibrium, substitution patterns, and evolutionary inference Evolutionary implications of recombination Impact on phylogenetic analyses Recombination analysis as a multifaceted discipline Detecting recombination Recombinant identification and breakpoint detection Recombination rate Overview of recombination detection tools Performance of recombination detection tools 517 Detecting and characterizing individual recombination events 519 Theory 519 Mika Salminen and Darren Martin 16.1 Introduction Requirements for detecting recombination Theoretical basis for recombination detection methods Identifying and characterizing actual recombination events 530 Practice 532 Mika Salminen and Darren Martin 16.5 Existing tools for recombination analysis Analyzing example sequences to detect and characterize individual recombination events Exercise 1: Working with SIM PLOT Exercise 2: Mapping recombination with SIMPLOT Exercise 3: Using the "groups" feature of SIMPLOT Exercise 4: Setting up RDP3 to do an exploratory analysis Exercise 5: Doing a simple exploratory analysis with RDP Exercise 6: Using RDP3 to refine a recombination hypothesis 546 Section VII: Population genetics The coalescent: population genetic inference using genealogies 551 Allen Rodrigo 17.1 Introduction The Kingman coalescent Effective population size 554

12 xvi 17.4 The mutation clock Demographic history and the coalescent Coalescent-based inference The serial coalescent Advanced topics 561 Bayesian evolutionary analysis by sampling trees 564 Theory 564 Alexei J. Drummond and Andrew Rambaut 18.1 Background Bayesian MCMC for genealogy-based population genetics Implementation Input format Output and results Computational performance Results and discussion Substitution models and rate models among sites Rate models among branches, divergence time estimation, and time-stamped data Tree priors Multiple data partitions and linking and unlinking parameters Definitions and units of the standard parameters and variables Model comparison Conclusions 575 Practice 576 Alexei J. Drummond and Andrew Rambaut 18.4 The BEAST software package Running BEAUTI Loading the NEXUS file Setting the dates of the taxa Translating the data in amino acid sequences Setting the evolutionary model Setting up the operators Setting the MCMC options Running BEAST Analyzing the BEAST output Summarizing the trees Viewing the annotated tree Conclusion and resources 590

13 xvii i 9 LAMARC: Estimating population genetic parameters from molecular data 592 Theory 592 Mary K. Kuhner 19.1 Introduction Basis of the Metropolis-Hastings MCMC sampler Bayesian vs. likelihood sampling Random sample Stability No other forces Evolutionary model Large population relative to sample Adequate run time 597 Practice 598 Mary K. Kuhner 19.3 The LAMARC software package FLUCTUATE (COALESCE) MIGRATE-N RECOMBINE LAMARC Starting values Space and time Sample size considerations Virus-specific issues Multiple loci Rapid growth rates Sequential samples An exercise with LAMARC Converting data using the LAMARCfile converter Estimating the population parameters Analyzing the output Conclusions 611 Section VIII: Additional topics Assessing substitution saturation with DAMBE 615 Theory 615 Xuhua Xia 20.1 The problem of substitution saturation Steel's method: potential problem, limitation, and implementation in DAMBE 616

14 xviii 20.3 Xia's method: its problem, limitation, and implementation in DAMBE 621 Practice 624 Xuhua Xia and Philippe Lemey 20.4 Working with the VertebrateMtCOLFAS file Working with the InvertebrateEFl a.fas file Working with the SIV.FAS file i Split networks. A tool for exploring complex evolutionary relationships in molecular data 631 Theory 631 Vincent Moulton and Katharina T. Huber 21.1 Understanding evolutionary relationships through networks An introduction to split decomposition theory The Buneman tree Split decomposition From weakly compatible splits to networks Alternative ways to compute split networks NeighborNet Median networks Consensus networks and supernetworks 640 Practice 642 Vincent Moulton and Katharina T. Huber 21.5 The SPLITSTREE program Introduction Downloading SPLITSTREE Using SPLITSTREE on the mtdna data set Getting started The fit index Laying out split networks Recomputing split networks Computing trees Computing different networks Bootstrapping Printing Using SPLITSTREE on other data sets 648 Glossary 654 References 672 Index 709

Contents. list of contributors. Preface. Basic concepts of molecular evolution 3

Nucleotide substitutions are considered a homogeneous process?

What is used to determine phylogenetic inference?