An Introduction to Phylogenetics

Size: px
Start display at page:

Download "An Introduction to Phylogenetics"

Transcription

1 An Introduction to Phylogenetics Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 4, / 70

2 Phylogenetics and Darwin A phylogeny is a tree diagram that shows the evolutionary relationships among a group of species. The first phylogeny is due to Charles Darwin. In 1837, shortly after his famous five-year voyage as naturalist on the Beagle, Darwin sketched a tree diagram in one of his notebooks. This simple sketch is remarkably similar to modern diagrams of phylogenies. In addition, the sole figure in The Origin of Species is a phylogeny. Introduction History and Darwin 2 / 70

3 Darwin s Trees Darwin s 1837 Sketch Figure from The Origin of Species Introduction History and Darwin 3 / 70

4 Tree of Life In The Origin of Species, Darwin describes a Tree of Life that represents the true evolutionary history of life. The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth.... The limbs divided into great branches, and these into lesser and lesser branches, were themselves once, when the tree was small, budding twigs; and this connexion of the former and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate to groups.... As buds give rise by growth to fresh buds, and these, if vigorous, branch out and overtop on all a feebler branch, so by generation I believe it has been with the Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Introduction History and Darwin 4 / 70

5 Early Phylogenetics Shortly after the 1859 publication of The Origin of Species, many biologists came to accept the truth of a universal Tree of Life. Ernst Haeckel and many others created highly stylized trees that were based on expert opinion. A century passed before development of formal scientific methods for estimating phylogenies began. Introduction Early Phylogenetics 5 / 70

6 Haeckel s Stylized Trees Introduction Early Phylogenetics 6 / 70

7 Modern Phylogenetics Phylogenies are usually estimated from aligned DNA sequence data. Phylogenetics is the primary tool for systematics. Phylogenetics is used for studying viruses such as HIV. Phylogenetics has been used in court for forensic purposes. Phylogenetics is being used increasingly in comparative genomics and study of gene function. Introduction Some Modern Uses of Phylogenies 7 / 70

8 Phylogenetics and Systematics Phylogenetic methods, particularly for molecular sequence data, have become the primary tool for systemicists to determine evolutionary relationships. These tools have been used to confirm expected relationships for example, that chimpanzees are the closest living relative to humans and have also been key in revealing several more surprising findings, including: birds are descended from dinosaurs; polar bears form a monophyletic group within brown bears; the most closely related land mammal to whales is the hippopotamus. Introduction Some Modern Uses of Phylogenies 8 / 70

9 Phylogenetic Tree of Whales ScienceDirect - Full Size Image 09/04/ :10 AM CLOSE urlversion=0&_userid=443835&md5=df655f7ee732c807488f9262b841bcfc Page 1 of 2 Introduction Some Modern Uses of Phylogenies 9 / 70

10 Phylogenetics and Forensics Phylogenetic trees have been used in several instances in the courts to provide evidence about the likely transmission of HIV. Examples include: Confirming that a nurse contracted HIV from mishap with a broken glass blood collection tube from an infected patient and not from an alternative source; Providing evidence of deliberate infection in a criminal case; Indicated that an infected friend was likely not the direct source of infection in a case. Introduction Phylogenetics and Forensics 10 / 70

11 Forensic Phylogenetic Tree Reviews 30 HIV Forensics Figure 2 RF BRVA 10% US3 D31 SFMHS8 P896 SFMHS7 ENVVG SFMHS2 SC YU2 ENVUSR2 NY5CG SC14C US2 JH32 ENVVF WEAU160 JRCSF CAM1 DH HXB2 SF128A LC50 LC49 SFMHS1 ADA ALA1 SFMHS20 85WCIPR54 MNCG US4 89SP061 US1 SF2 ENVVA PHI159 CDC452 HAN WR MBC200 GB8.C1 A40 A34 A41 A32 A37 A36 A30 A39 A38 A44 B28 B29 B22 B25 B27 B26 78 B24 B A43 A31 A33 A42 MBC925 MBC18 TH MANC 3 RL42 PHI LC47 LC46 OYI MBCD36 LC45 LC Figure 2. Neighbor-joining phylogram representing the reconstruction of the phylogenetic relationships between the env (C2-V5) sequences obtained from the index case (A31-44), the alleged recipient (B22-29), three local controls (LC45 and LC48; LC46 and LC47; and LC49 and LC50) and 48 sequences chosen from GenBank. Ten iterations of random sequence addition were used. Scale bar represents 10% genetic distance. Bootstrap values are shown at nodes with greater than 70% support. Introduction Phylogenetics and Forensics 11 / 70

12 DNA Data from a Sample of Birds First 24 bases of 1558 from Cox I gene. Alligator GTG AAC TTC CAC --- CGT TGA CTC... Emu GTG ACA TTC ATT ACT CGA TGA TTT... Kiwi GTG ACC TTT ACT ACT CGA TGA CTC... Ostrich GTG ACC TTC ATT ACT CGA TGA CTT... Swan GTG ACC TTC ATC AAC CGA TGA CTA... Goose GTG ACC TTC ATC AAC CGA TGA CTA... Chicken GTG ACC TTC ATC AAC CGA TGA TTA... Woodpecker GTG ACC TTC ATC AAC CGA TGA TTA... Finch ATG ACA TAC ATT AAC CGA TGA TTA... Ibis GTG ACC TTC ATC AAC CGA TGA CTA... Stork GTG ACC TTC ATT ACC CGA TGA CTA... Osprey ATG ACA TTC ATC AAC CGA TGA CTA... Falcon GTG ACC TTC ATC AAC CGA TGA CTA... Vulture ATG ACA TTC ATC AAT CGA TGA CTA... Penguin GTG ACC TTC ATT AAC CGA TGA CTA... Example Phylogeny of Birds 12 / 70

13 An Estimated Phylogeny Penguin Vulture Stork Ibis Woodpecker Osprey Finch Falcon Chicken Goose Swan Ostrich Kiwi Emu Alligator Example Phylogeny of Birds 13 / 70

14 Activity 1: Example Tree How many descendent taxa does the common ancestor of taxa A and C have? Which taxon is sister to A? Which taxa are more closely related, A and C or C and D? Which taxa are more closely related, A and E or D and E? F E D C B A Trees Phylogeny Basics 14 / 70

15 Activity 2: Compare Trees Which trees have the same tree topology? F E D C B A E F D A B C E F D A B C F E D C B A B A C D E F E F D A B C

16 Activity 3: Unrooted Trees Some methods estimate unrooted trees. If C is the outgroup, what is the rooted tree topology? E D If taxon C is the outgroup, which node is sister to B? If taxon A is the outgroup, which node is sister to B? How many rooted tree topologies are consistent with this unrooted tree topology? A B C Trees Unrooted Trees 16 / 70

17 How Many Trees? # of Taxa # Unrooted Trees # Rooted Trees Trees Counting Trees 17 / 70

18 Formula for Counting Trees The number of rooted tree topologies with n taxa is 1 3 (2n 3) (2n 3)!! for n 3. There are more rooted trees with 51 species ( ) than estimated # of hydrogen atoms in the universe ( ). Biologists often estimate trees with more than 100 species. Trees Counting Trees 18 / 70

19 Probabilistic Framework Essentially, all models are wrong, but some are useful. George Box Commonly used models of molecular evolution treat sites as independent. These common models just need to describe the substitutions among four bases A, C, G, and T at a single site over time. The substitution process is modeled as a continuous-time Markov chain. Models of Molecular Evolution Continuous-time Markov Chains 19 / 70

20 Markov Property Use the notation X (t) to represent the base at time t. X (t) {A, C, G, T } for DNA. Formal statement: P {X (s + t) = j X (s) = i, X (u) = x(u) for u < s} = P {X (s + t) = j X (s) = i} Informal understanding: given the present, the past is independent of the future If the expression does not depend on the time s, the Markov process is called homogeneous. Models of Molecular Evolution Continuous-time Markov Chains 20 / 70

21 Rate Matrix A stationary, homogeneous, continuous-time, finite-state-space Markov chain is parameterized by a rate matrix where: off-diagonal rates are nonnegative; diagonal terms are negative row sums of off-diagonal elements; consequently, row sums are zero. Example: Q = {q ij } = Models of Molecular Evolution Continuous-time Markov Chains 21 / 70

22 Alarm Clock Interpretation How to simulate a continuous-time Markov chain beginning in state i. time to the next transition Exp(qi ) where q i q ii. transition is to state j with probability q ij k i q ik Models of Molecular Evolution Continuous-time Markov Chains 22 / 70

23 Path Probability Density Calculation Example: Begin at A, change to G at time 0.3, change to C at time 0.8, and then no more changes before time t = 1. P {path} = P {begin at A} ( 1.1e (1.1)(0.3) 0.6 ) 1.1 ( 0.9e (0.9)(0.5) 0.3 ) 0.9 (e (1.1)(0.2)) Models of Molecular Evolution Continuous-time Markov Chains 23 / 70

24 Probability Transition Matrices The transition matrix is P(t) = e Qt where e A = k=0 A k k! = I + A + A2 2 + A3 6 + A probability transition matrix has non-negative values and each row sums to one. Each row contains the probabilities from a probability distribution on the possible states of the Markov process. Models of Molecular Evolution Continuous-time Markov Chains 24 / 70

25 Examples P(0.1) = P(1) = P(0.5) = P(10) = Models of Molecular Evolution Continuous-time Markov Chains 25 / 70

26 Spectral Decomposition The matrix Q can be factored as V ΛV 1 where Λ is a diagonal matrix of the eigenvalues and V is the matrix whose columns are corresponding eigenvectors. All rate matrices Q will have an eigenvalue 0 with an eigenvector of all 1s as the rows sum to 0 by construction. Our example rate matrix Q has eigenvalues 0, 1, 1.5, and 2. The probability transition matrix is of the form P(t) = V e Λt V 1. This means that each probability can be written as a linear combination of exponential functions of the product of the time t and an eigenvalue. P(t) = i w ie λ i t. Models of Molecular Evolution Continuous-time Markov Chains 26 / 70

27 Numerical Example Q = V ΛV = Models of Molecular Evolution Continuous-time Markov Chains 27 / 70

28 Stationary Distribution Well-behaved continuous-time Markov chains have a stationary distribution π. (For finite-state-space chains, irreducibility is sufficient.) When the time t is large enough, the probability P ij (t) will be close to π j for each i. (See P(10) from earlier.) The stationary distribution can be thought of as a long-run average the proportion of time the state spends in state i converges to π i. The stationary distribution satisfies π Q = 0. Also, π P(t) = π for any time t. Models of Molecular Evolution Continuous-time Markov Chains 28 / 70

29 Numerical Example π Q = 0 ( ) = ( ) Models of Molecular Evolution Continuous-time Markov Chains 29 / 70

30 Usual Parameterization The matrix Q = {q ij } is typically scaled and parameterized for i j where µ = i q ij = r ij π j /µ π i r ij π j which guarantees that π will be the stationary distribution when r ij = r ji. With this scaling, there is one expected transition per unit time. j i Models of Molecular Evolution Continuous-time Markov Chains 30 / 70

31 Time-reversibility A continuous-time Markov chain is time-reversible if the probability of a sequence of events is the same going forward as it is going backwards. The matrix Q is the matrix for a time-reversible Markov chain when π i q ij = π j q ji for all i and j. That is, the overall rate of substitutions from i to j equals the overall rate of substitutions from j to i for every pair of states i and j. The matrix equivalent is ΠQ = Q Π where Π = diag(π). Models of Molecular Evolution Continuous-time Markov Chains 31 / 70

32 General Time-Reversible Model The GTR model is the most general basic time-reversible continuous-time Markov model for nucleotide substitution. The model is typically parameterized with 8 free parameters where { rij π j /µ for i j q ij = j i q ij for i = j with µ = i π i j i r ijπ j. The stationary distribution pi has three free parameters as π sums to one; The vector r = (rac, r AG,..., r GT ) is usually constrained to five degrees of freedom (either by setting r GT = 1 or constraining the sum). Many other popular models are special cases. These models are often named by the initials of the authors and the year in which they were published. Models of Molecular Evolution General Time-Reversible Model 32 / 70

33 Other Common Models Long Name Short Name π r Jukes-Cantor JC69 uniform r AC = r AG = r AT = r CG = r CT = r GT Kimura 80 K80 uniform r AG = r CT, r AC = r AT = r CG = r GT Felsenstein 81 F81 free r AC = r AG = r AT = r CG = r CT = r GT Felsenstein 84 F84 free r AC = r AT = r CG = r GT r AG = (1 + κ/(π A + π G ))r AC r CT = (1 + κ/(π C + π T ))r AC Hasegawa et al. HKY85 free r AC = r AT = r CG = r GT r AG = r CT = κr AC Timura-Nei 93 TN93 free r AC = r AT = r CG = r GT r AG = κ 1 r AC r CT = κ 2 r AC Models of Molecular Evolution General Time-Reversible Model 33 / 70

34 Transition Probabilities There are closed form solutions to the probability transition matrices for each of the previous models except for GTR. All but GTR are special cases of Tamura-Nei. Models of Molecular Evolution General Time-Reversible Model 34 / 70

35 Tamura-Nei Model The rate matrix for TN93 is: 0 Q = µ 1 (κ R π G + π Y ) π C κ R π G π T π A (κ Y π T + π R ) π G κ Y π T κ R π A π C (κ R π A + π Y ) π T π A κ Y π C π G (κ Y π C + π R ) 1 C A where πr = π A + π Y ; πy = π C + π T ; µ = 2(κR π A π G + κ Y π C π T + π R π Y ). Models of Molecular Evolution General Time-Reversible Model 35 / 70

36 Tamura-Nei Model The transition probabilites for TN93 are P(t) = π A + π A π Y π R π A + π A π Y π R π A (1 β 2 ) π A (1 β 2 ) β 2 + π G π R β 3 π C (1 β 2 ) π G + π G π Y π R β 2 π A π R β 3 π C (1 β 2 ) π G + π G π Y π R π C + π C π R π Y π C + π C π R π Y β 2 π G π R β 3 π T (1 β 2 ) β 2 + π A π R β 3 π T (1 β 2 ) β 2 + π T β π 4 π G (1 β 2 ) π T + π T π R β Y π 2 π T β Y π 4 Y β 2 π C β π 4 π G (1 β 2 ) π T + π T π R β Y π 2 + π C β Y π 4 Y where β2 = exp( t/µ); β 3 = exp( (π R κ 1 + π Y )t/µ); β 4 = exp( (π Y κ 2 + π R )t/µ). Models of Molecular Evolution General Time-Reversible Model 36 / 70

37 Rate Variation Among Sites A common extension to the standard CTMC models is to assume that there is rate variation among sites. At these sites, the Q matrix is multiplied by a site-specific rate. The two most popular extensions are: Invariant sites: some sites have rate 0 Gamma-distributed rates: rates are drawn from a mean 1 gamma distribution For computational tractability, the Gamma distribution is typically replaced by a mean 1 discrete distribution with four distinct rates based on quantiles of a Gamma distribution. Models of Molecular Evolution General Time-Reversible Model 37 / 70

38 Other Extensions There are many other model extensions in common use and under development. It is common to partition sites (by gene, by codon position, by genomic location) and to use different models for each part. The covarion model allows different lineages to have different rates at the same site. This is typically modeled with a hidden Markov model where the site can turn off. There are models for amino acid substitution, models for codons, models for RNA pairs, models that incorporate protein structure information, and so on. Current models still do not capture much of the important biological processes that affect evolution of molecular sequences. Models of Molecular Evolution General Time-Reversible Model 38 / 70

39 Distance Between Pairs of Taxa In a two-taxon tree, the distance between two taxa can be estimated under any model by maximum likelihood. If the distance is t and at site i one species has base A and the other has base C, the contribution to the likelihood at this site j is for a time-reversible model. The overall likelihood is L j (t) = π A P AC (t) = π C P CA (t) L(t) = j L j (t) and the log-likelihood is l(t) = j log L j (t) = j ( log πx[j] + log P x[j]y[j] (t) ) Maximum Likelihood Estimation Maximum Likelihood Estimation for Pairs 39 / 70

40 Distance Between Pairs of Taxa For models with free π, it is common to estimate π with observed base frequencies. Other parameters are usually estimated by maximum likelihood. The simplest models have closed form solutions, others require numerical optimization. Maximum Likelihood Estimation Maximum Likelihood Estimation for Pairs 40 / 70

41 Notation for the Alignment An alignment of m taxa and n sites will have mn nucleotide bases. Let the observed base for the ith taxon and the jth site be x ij. Maximum Likelihood Estimation Likelihood Calculations on Trees 41 / 70

42 Notation for the Tree With a time-reversible model, the location of a root (where the CTMC begins at stationarity) does not affect the likelihood calculation. We can assume an unrooted tree without loss of generality. An unrooted tree with m taxa will have m 2 internal nodes. Number these nodes i = 1,..., 2m 2 with the first m for leaf nodes and the last m 2 for internal nodes. For calculation purposes, we will denote node ρ (which could be any node) as the root. There are 2m 3 edges in the tree, numbered e = 1,..., 2m 3. Relative to root node ρ, edge e connects parent node p(e) and child node c(e) where p(e) is closer to ρ than c(e). Edge e has length t e. Maximum Likelihood Estimation Likelihood Calculations on Trees 42 / 70

43 Notation for Unobserved Data The likelihood for a tree is computed by summing over all possible bases at the internal nodes for each of the n sites. For each site, there are 4 m 2 possible allocations of bases at internal nodes we will index by k. Internal node i is set to nucleotide b ik at the kth allocation, i = m + 1,..., 2m 2. Let z(i, j, k) be the nucleotide at node i, site j, and allocation k. z(i, j, k) = { xij if i m (i is a leaf node) if i > m (i is an internal node) b ik Maximum Likelihood Estimation Likelihood Calculations on Trees 43 / 70

44 Likelihood of a Tree Let P(t) be the 4 4 probability transition matrix over an edge of length t. The likelihood of the tree is ( ) π z(ρ,j,k) P z(p(e),j,k)z(c(e),j,k) (t e ) j k Notice that the sum is over the 4 m 2 possible allocations. A naive calculation would not be tractible for large trees. e Maximum Likelihood Estimation Likelihood Calculations on Trees 44 / 70

45 Felsenstein s Pruning Algorithm Felsenstein s pruning algorithm is an example of dynamic programming. By saving partial calculations, the time complexity of the likelihood evaluation grows linearly with the number of sites, not exponentially. For each site and node, the algorithm depends on calculating the probability in the subtree rooted at that node for each possible base. The algorithm begins at the leaves of the tree and recurses to the root. The likelihood of the site is a weighted average of the conditional subtree probabilities at the root weighted by the stationary distribution. Maximum Likelihood Estimation Likelihood Calculations on Trees 45 / 70

46 The Algorithm for One Site Define f j (i, b) to be the probability of the data at site j in the subtree rooted at node i conditional on the nucleotide at this node being b. For a leaf node, f j (i, b) = 1{x ij = b} For an internal node with children nodes indexed by c attached by edges of length t c, f j (i, b) = ( ) P bz (t c )f j (c, z) c z The likelihood at site j is L j = b π b f j (ρ, b) Maximum Likelihood Estimation Likelihood Calculations on Trees 46 / 70

47 Example Do an example with five taxa for one site. See chalk board for example. P1 P Maximum Likelihood Estimation Likelihood Calculations on Trees 47 / 70

48 Example f A C G T [1,] e e+00 [2,] e e+00 [3,] e e+00 [4,] e e+00 [5,] e e+00 [6,] e e-03 [7,] e e-04 [8,] e e-05 Maximum Likelihood Estimation Likelihood Calculations on Trees 48 / 70

49 Maximum Likelihood Estimation for one Tree For a single tree topology, the ML estimation requires optimization of branch lengths and of any parameters in the substitution model. Numerical optimization methods are required even for simple models and small trees. Maximum Likelihood Estimation Likelihood Calculations on Trees 49 / 70

50 Tree Search The search for the maximum likelihood tree conceptually requires obtaining the maximum likelihood for each possible tree topology and then picking the best of these. For more than a dozen or so taxa, exhaustive search is non feasible. Heuristic search algorithms typically define a neighborhood structure for possible topologies. The search goes through neighbors and jumps to the first neighbor with a higher likelihood. When all neighbors are inferior to the current tree, the search stops. Much improvement has been made in recent years (RAxML and GARLI are two modern ML programs). Maximum Likelihood Estimation Search for Maximum Likelihood 50 / 70

51 Bayesian Inference In Bayesian inference, the posterior distribution is proportional to the product of the likelihood and the prior distribution. For parameters θ and data D, P {θ D} = P {D θ} P {θ} P {D}. The denominator is the marginal likelihood of the data, which is the integral of the likelihood against the prior distribution. Bayesian Phylogenetics Mathematical Background 51 / 70

52 Bayesian Phylogenetics For a phylogenetic problem, the parameter θ typically includes the tree topology, the edge lengths, and parameters for the substitution model. θ = (τ, ν, φ) Often we assume independence of these components: P {θ} = P {τ} P {ν} P {φ}. In a typical phylogenetic problem, the marginal likelihood cannot be computed as P {D} = P {D θ} P {θ} dθ Θ is a sum of very many terms (one for each topology) where each term is a high-dimensional integral of a complicated function. Bayesian Phylogenetics Phylogenetics 52 / 70

53 Phylogenetic Inference We may be interested in the posterior distribution of the tree topology, P {τ D}. When this posterior distribution is diffuse, we can summarize it by computing posterior distributions of clades. The posterior probability of a clade C is the sum of the posterior probabilities of all tree topologies that contain it. P {C D} = P {τ D} τ:c τ A consensus tree which includes as many clades with high posterior probability as possible is often used as a single tree summary of a distribution of the tree topology. Bayesian Phylogenetics Phylogenetics 53 / 70

54 Sample-based Inference Any aspect of a posterior distribution can be estimated from a sample drawn from the distribution. For example, the sample proportion of trees with topology τ 0 is an estimate of P {τ 0 D}. Also, the sample mean of a transition/transversion parameter κ is an estimate of the posterior mean E [κ D]. But how do we sample from a complicated posterior distribution? Bayesian Phylogenetics Phylogenetics 54 / 70

55 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) is a mathematical method for obtaining dependent samples from a target distribution (such as a posterior distribution). The idea is to construct a Markov chain whose state space is the parameter space Θ where the stationary distribution of the Markov chain matches the target distribution, say P {θ D}. Simulating the Markov chain produces a sample θ 0, θ 1,... which, after discarding an initial burn-in portion, may be treated as a dependent sample from the target distribution. MCMC MCMC 55 / 70

56 Metropolis-Hastings For notational convenience, let the target distribution be π(θ) = P {θ D}. The most common form of MCMC uses the Metropolis-Hastings algorithm in which a proposal distribution q which can depend on the most recently sampled θ i generates a proposal θ which is accepted with some probability. When accepted, θ i+1 = θ. When rejected, θ i+1 = θ i. The proposal distribution q is essentially arbitrary provided it can move around the entire space Θ. MCMC MCMC 56 / 70

57 Metroplis-Hastings Algorithm The acceptance probability is { min 1, π(θ ) π(θ) q(θ } θ ) q(θ θ) J where J is a Jacobian. Notice the target density appears only as a ratio this means that it only need be known up to scalar, and we can simply evaluate h(θ) = P {D θ} P {θ} since π(θ ) π(θ) = P {D θ } P {θ } /P {D} P {D θ} P {θ} /P {D} = h(θ ) h(θ) Note that the proposal ratio can be tricky to compute. q(θ θ ) q(θ θ) MCMC MCMC 57 / 70

58 MCMC Example Target Distribution MCMC Example 58 / 70

59 First Point Initial Point MCMC Example 59 / 70

60 Proposal Distribution Proposal Distribution MCMC Example 60 / 70

61 First Proposal First Proposal Accept with probability 1 MCMC Example 61 / 70

62 Second Proposal Second Proposal Accept with probability MCMC Example 62 / 70

63 Third Proposal Third Proposal Accept with probability MCMC Example 63 / 70

64 Beginning of Sample Sample So Far MCMC Example 64 / 70

65 Larger Sample Second Proposal MCMC Example 65 / 70

66 Comparison to Target MCMC Example 66 / 70

67 Subtree Pruning Regrafting See example from board. More details will be posted in a separate document. Acceptance Probabilities Examples 67 / 70

68 Rescaling a Tree More details will be posted in a separate document. Acceptance Probabilities Examples 68 / 70

69 Cautions MCMC does not always converge; Should always run several chains with different random numbers and compare answers; If the true tree has some very short internal edges, Bayesian inference can mislead; Different likelihood models can lead to different results. Summary Cautions 69 / 70

70 Bayesian Inference Development of Bayesian methods has led to continual improvement in our ability to model and learn about molecular evolution. Bayesian inference uses likelihood, but requires a prior distribution. Bayesian inference is computationally intensive, but can be less so than ML plus bootstrapping. Bayesian inference directly measures items of interest on an easily interpretable probability scale. Some folks dislike the requirement of specifying a prior distribution. Summary Cautions 70 / 70

Bayesian Phylogeny and Measures of Branch Support

Bayesian Phylogeny and Measures of Branch Support Bayesian Phylogeny and Measures of Branch Support Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Hierarchical Bayesian Modeling of the HIV Response to Therapy Hierarchical Bayesian Modeling of the HIV Response to Therapy Shane T. Jensen Department of Statistics, The Wharton School, University of Pennsylvania March 23, 2010 Joint Work with Alex Braunstein and

More information

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML 9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze

More information

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

Phylogenetic Trees Made Easy

Phylogenetic Trees Made Easy Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

Arbres formels et Arbre(s) de la Vie

Arbres formels et Arbre(s) de la Vie Arbres formels et Arbre(s) de la Vie A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Basic principles DATA sequence alignment distance

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,

More information

Molecular Clocks and Tree Dating with r8s and BEAST

Molecular Clocks and Tree Dating with r8s and BEAST Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today

More information

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis Finding lusters in Phylogenetic Trees: Special Type of luster nalysis Why try to identify clusters in phylogenetic trees? xample: origin of HIV. NUMR: Why are there so many distinct clusters? LUR04-7 SYNHRONY:

More information

Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS

Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

A Rough Guide to BEAST 1.4

A Rough Guide to BEAST 1.4 A Rough Guide to BEAST 1.4 Alexei J. Drummond 1, Simon Y.W. Ho, Nic Rawlence and Andrew Rambaut 2 1 Department of Computer Science The University of Auckland, Private Bag 92019 Auckland, New Zealand alexei@cs.auckland.ac.nz

More information

Hedging Options In The Incomplete Market With Stochastic Volatility. Rituparna Sen Sunday, Nov 15

Hedging Options In The Incomplete Market With Stochastic Volatility. Rituparna Sen Sunday, Nov 15 Hedging Options In The Incomplete Market With Stochastic Volatility Rituparna Sen Sunday, Nov 15 1. Motivation This is a pure jump model and hence avoids the theoretical drawbacks of continuous path models.

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference Stephane Guindon, F. Le Thiec, Patrice Duroux, Olivier Gascuel To cite this version: Stephane Guindon, F. Le Thiec, Patrice

More information

Hidden Markov Models

Hidden Markov Models 8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

More information

Bayesian coalescent inference of population size history

Bayesian coalescent inference of population size history Bayesian coalescent inference of population size history Alexei Drummond University of Auckland Workshop on Population and Speciation Genomics, 2016 1st February 2016 1 / 39 BEAST tutorials Population

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Visualization of Phylogenetic Trees and Metadata

Visualization of Phylogenetic Trees and Metadata Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

(http://genomes.urv.es/caical) TUTORIAL. (July 2006)

(http://genomes.urv.es/caical) TUTORIAL. (July 2006) (http://genomes.urv.es/caical) TUTORIAL (July 2006) CAIcal manual 2 Table of contents Introduction... 3 Required inputs... 5 SECTION A Calculation of parameters... 8 SECTION B CAI calculation for FASTA

More information

Phylogenetic systematics turns over a new leaf

Phylogenetic systematics turns over a new leaf 30 Review Phylogenetic systematics turns over a new leaf Paul O. Lewis Long restricted to the domain of molecular systematics and studies of molecular evolution, likelihood methods are now being used in

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Notes on Determinant

Notes on Determinant ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question. Name: Class: Date: Chapter 17 Practice Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The correct order for the levels of Linnaeus's classification system,

More information

Imperfect Debugging in Software Reliability

Imperfect Debugging in Software Reliability Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health

More information

Introduction to Phylogenetic Analysis

Introduction to Phylogenetic Analysis Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.

More information

Generating Valid 4 4 Correlation Matrices

Generating Valid 4 4 Correlation Matrices Applied Mathematics E-Notes, 7(2007), 53-59 c ISSN 1607-2510 Available free at mirror sites of http://www.math.nthu.edu.tw/ amen/ Generating Valid 4 4 Correlation Matrices Mark Budden, Paul Hadavas, Lorrie

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

GENERATING THE FIBONACCI CHAIN IN O(log n) SPACE AND O(n) TIME J. Patera

GENERATING THE FIBONACCI CHAIN IN O(log n) SPACE AND O(n) TIME J. Patera ˆ ˆŠ Œ ˆ ˆ Œ ƒ Ÿ 2002.. 33.. 7 Š 539.12.01 GENERATING THE FIBONACCI CHAIN IN O(log n) SPACE AND O(n) TIME J. Patera Department of Mathematics, Faculty of Nuclear Science and Physical Engineering, Czech

More information

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION Sérgio Pequito, Stephen Kruzick, Soummya Kar, José M. F. Moura, A. Pedro Aguiar Department of Electrical and Computer Engineering

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

An Internal Model for Operational Risk Computation

An Internal Model for Operational Risk Computation An Internal Model for Operational Risk Computation Seminarios de Matemática Financiera Instituto MEFF-RiskLab, Madrid http://www.risklab-madrid.uam.es/ Nicolas Baud, Antoine Frachot & Thierry Roncalli

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Variables. Exploratory Data Analysis

Variables. Exploratory Data Analysis Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data. A common situation is for a data set to be represented as a matrix. There is

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

More information

a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2. Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

More information

Analysis of Algorithms I: Optimal Binary Search Trees

Analysis of Algorithms I: Optimal Binary Search Trees Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search

More information

Divergence Time Estimation using BEAST v1.7.5

Divergence Time Estimation using BEAST v1.7.5 Divergence Time Estimation using BEAST v1.7.5 Central among the questions explored in biology are those that seek to understand the timing and rates of evolutionary processes. Accurate estimates of species

More information

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

More information

Notes on Symmetric Matrices

Notes on Symmetric Matrices CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among A branch-and-bound algorithm for the inference of ancestral amino-acid sequences when the replacement rate varies among sites Tal Pupko 1,*, Itsik Pe er 2, Masami Hasegawa 1, Dan Graur 3, and Nir Friedman

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

SPECTRAL POLYNOMIAL ALGORITHMS FOR COMPUTING BI-DIAGONAL REPRESENTATIONS FOR PHASE TYPE DISTRIBUTIONS AND MATRIX-EXPONENTIAL DISTRIBUTIONS

SPECTRAL POLYNOMIAL ALGORITHMS FOR COMPUTING BI-DIAGONAL REPRESENTATIONS FOR PHASE TYPE DISTRIBUTIONS AND MATRIX-EXPONENTIAL DISTRIBUTIONS Stochastic Models, 22:289 317, 2006 Copyright Taylor & Francis Group, LLC ISSN: 1532-6349 print/1532-4214 online DOI: 10.1080/15326340600649045 SPECTRAL POLYNOMIAL ALGORITHMS FOR COMPUTING BI-DIAGONAL

More information

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why

More information

Numerical methods for American options

Numerical methods for American options Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

More information

The Art of the Tree of Life. Catherine Ibes & Priscilla Spears March 2012

The Art of the Tree of Life. Catherine Ibes & Priscilla Spears March 2012 The Art of the Tree of Life Catherine Ibes & Priscilla Spears March 2012 from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved. Charles Darwin, The

More information

Monte Carlo Simulation

Monte Carlo Simulation 1 Monte Carlo Simulation Stefan Weber Leibniz Universität Hannover email: sweber@stochastik.uni-hannover.de web: www.stochastik.uni-hannover.de/ sweber Monte Carlo Simulation 2 Quantifying and Hedging

More information

Monte Carlo-based statistical methods (MASM11/FMS091)

Monte Carlo-based statistical methods (MASM11/FMS091) Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based

More information

Credit Risk Models: An Overview

Credit Risk Models: An Overview Credit Risk Models: An Overview Paul Embrechts, Rüdiger Frey, Alexander McNeil ETH Zürich c 2003 (Embrechts, Frey, McNeil) A. Multivariate Models for Portfolio Credit Risk 1. Modelling Dependent Defaults:

More information

A hidden Markov model for criminal behaviour classification

A hidden Markov model for criminal behaviour classification RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University

More information

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

A comparison of methods for estimating the transition:transversion ratio from DNA sequences Molecular Phylogenetics and Evolution 32 (2004) 495 503 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev A comparison of methods for estimating the transition:transversion ratio from

More information

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

Inference of Large Phylogenetic Trees on Parallel Architectures. Michael Ott

Inference of Large Phylogenetic Trees on Parallel Architectures. Michael Ott Inference of Large Phylogenetic Trees on Parallel Architectures Michael Ott TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Rechnertechnik und Rechnerorganisation / Parallelrechnerarchitektur Inference of

More information

Approximating the Coalescent with Recombination. Niall Cardin Corpus Christi College, University of Oxford April 2, 2007

Approximating the Coalescent with Recombination. Niall Cardin Corpus Christi College, University of Oxford April 2, 2007 Approximating the Coalescent with Recombination A Thesis submitted for the Degree of Doctor of Philosophy Niall Cardin Corpus Christi College, University of Oxford April 2, 2007 Approximating the Coalescent

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

Data Partitions and Complex Models in Bayesian Analysis: The Phylogeny of Gymnophthalmid Lizards

Data Partitions and Complex Models in Bayesian Analysis: The Phylogeny of Gymnophthalmid Lizards Syst. Biol. 53(3):448 469, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490445797 Data Partitions and Complex Models in Bayesian Analysis:

More information

The Characteristic Polynomial

The Characteristic Polynomial Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

More information

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Conductance, the Normalized Laplacian, and Cheeger s Inequality Spectral Graph Theory Lecture 6 Conductance, the Normalized Laplacian, and Cheeger s Inequality Daniel A. Spielman September 21, 2015 Disclaimer These notes are not necessarily an accurate representation

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

High Throughput Network Analysis

High Throughput Network Analysis High Throughput Network Analysis Sumeet Agarwal 1,2, Gabriel Villar 1,2,3, and Nick S Jones 2,4,5 1 Systems Biology Doctoral Training Centre, University of Oxford, Oxford OX1 3QD, United Kingdom 2 Department

More information

A Combinatorial Approach for Determining Phylogenetic Invariants for the General Model

A Combinatorial Approach for Determining Phylogenetic Invariants for the General Model A Combinatorial Approach for Determining Phylogenetic Invariants for the General Model Thomas R. Hagedorn CRM-2671 March 2000 Department of Mathematics and Statistics, The College of New Jersey, P.O. Box

More information

Borges, J. L. 1998. On exactitude in science. P. 325, In, Jorge Luis Borges, Collected Fictions (Trans. Hurley, H.) Penguin Books.

Borges, J. L. 1998. On exactitude in science. P. 325, In, Jorge Luis Borges, Collected Fictions (Trans. Hurley, H.) Penguin Books. ... In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, those

More information

The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision D.B. Grimes A.P. Shon R.P.N. Rao Dept. of Computer Science and Engineering University of Washington Seattle, WA

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400 Lecture 18: Applications of Dynamic Programming Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Problem of the Day

More information

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1 Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1 Ziheng Yang Department of Animal Science, Beijing Agricultural University Felsenstein s maximum-likelihood

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued). MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

More information

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE Batsirai Winmore Mazviona 1 Tafadzwa Chiduza 2 ABSTRACT In general insurance, companies need to use data on claims gathered from

More information