Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Size: px
Start display at page:

Download "Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment"

Transcription

1 Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment

2 A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need to know... Substitution matrices Phylogenetic trees Multiple sequence alignment We ll start with substitution matrices.

3 Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of mutation (or log-odds ratio) Derived from multiple sequence alignments Two commonly used matrices: PAM and BLOSUM PAM = percent accepted mutations (Dayhoff) BLOSUM = Blocks substitution matrix (Henikoff)

4 PAM M Dayhoff, 1978 Evolutionary time is measured in Percent Accepted Mutations, or PAMs One PAM of evolution means 1% of the residues/bases have changed, averaged over all 20 amino acids. To get the relative frequency of each type of mutation, we count the times it was observed in a database of multiple sequence alignments. Based on global alignments Assumes a Markov model for evolution. Margaret Oakley Dayhoff

5 BLOSUM Henikoff & Henikoff, 1992 Based on database of ungapped local alignments (BLOCKS) Alignments have lower similarity than PAM alignments. BLOSUM number indicates the percent identity level of sequences in the alignment. For example, for BLOSUM62 sequences with approximately 62% identity were counted. Some BLOCKS represent functional units, providing validation of the alignment. Steven Henikoff

6 Multiple Sequence Alignment A multiple sequence alignment is made using many pairwise sequence alignments

7 Columns in a MSA have a common evolutionary history a phylogenetic tree for one position in the alignment? N S A By aligning the sequences, we are asserting that the aligned residues in each column had a common ancestor.

8 A tree shows the evolutionary history of a single position Ancestral characters can be inferred by parsimony analysis. W N W W worm clam bird goat fish 8

9 Counting mutations without knowing ancestral sequences Naíve way: Assume any of the characters could be the ancestral one. Assume equal distance to the ancestor from each taxon. L K F L K F L K F L K F L K F L K F L K F W W N R L S K K P R L S K K P R L T K K P R L S K K P R L S R K P R L T R K P R L ~ K K P W W N If was the ancestor, then it mutated to a W twice, to N once, and stayed three times.

10 We could have picked W as the ancestor... L K F L K F L K F L K F L K F L K F L K F W W N R L S K K P R L S K K P R L T K K P R L S K K P R L S R K P R L T R K P R L ~ K K P W W N If W was the ancestor, then it mutated to a four times, to N once, and stayed W once. * *FYI: This is how you draw a phylogenetic tree when the branch order is not known.

11 Subsitution matrices are symmetrical Since we don't know which sequence came first, we don't know whether w or...is correct. So we count this as one mutation of each type. P(-->W) and P(W-->) are the same number. (That's why we only show the upper triangle) w

12 Summing the substitution counts We assume the ancestor is one of the observed amino acids, but we don't know which, so we try them all. W W N N N W one column of a MSA W symmetrical matrix

13 Next possible ancestor, again. We already counted this, so ignore it. W W N N N W W

14 W W N N N W 2 1 W 1

15 W W N N N W 2 1 W 0

16 N W W W N N W

17 Next... again W W N Counting as the ancestor many times as it appears recognizes the increased likelihood that (the most frequent aa at this position) is the true ancestor. N W N W 1 0 0

18 W W N (no counts for last seq.) N W N W 0 0 0

19 o to next column. Continue summing. W W N P P I N P P A Continue doing this for every column in every multiple sequence alignment... N W N W TOTAL=21 1

20 Probability ratios are expressed as log odds Substitutions (and many other things in bioinformatics) are expressed as a "likelihood ratio", or "odds ratio" of the observed data over the expected value. Likelihood and odds are synomyms for Probability. So Log Odds is the log (usually base 2) of the odds ratio. log odds ratio = log 2 (observed/expected )

21 How do you calculate log-odds? P() = 4/7 = 0.57 Observed probability of -> q = P(->)=6/21 = 0.29 Expected probability of ->, e = 0.57*0.57 = 0.33 odds ratio = q /e = 0.29/0.33 log odds ratio = log 2 (q /e ) If the lod is < 0., then the mutation is less likely than expected by chance. If it is > 0., it is more likely.

22 Same amino acids, different distribution, different outcome. P()=0.50 e = 0.25 q = 9/42 =0.21 lod = log 2 (0.21/0.25) = 0.2 A W W A N A A s spread over many columns P()=0.50 e = 0.25 q = 21/42 =0.5 lod = log 2 (0.50/0.25) = 1 W A W A W A A s concentrated

23 Different observations, same expectation P()=0.50, P(W)=0.14 e W = 0.07 q W = 7/42 =0.17 lod = log 2 (0.17/0.07) = 1.3 A W A W N A A and W seen together more often than expected. P()=0.50, P(W)=0.14 e W = 0.07 q = 3/42 =0.07 lod = log 2 (0.07/0.07) = 0 W A W A W A A s and W s not seen together.

24 In class exercise: et the substitution value for P->Q PQPP QQQP QQPP QPPP QQQP P Q P Q P(P)=, P(Q)= e PQ = q PQ = / = lod = log 2 (q PQ /e PQ ) = sequence alignment database. substitution counts expected (e), versus observed (q) for P->Q

25 PAM assumes Markovian evolution A Markov process is one where the likelihood of the next "state" depends only on the current state. The inference that evolution is Markovian assumes that base changes (or amino acid changes) occur at a constant rate and depend only on the identity of the current base (or amino acid). transition likelihood / MY current aa -> ->A ->V V->V V-> A V V millions of years (MY)

26 Markovian evolution is an extrapolation Start with one sequence. One position. Say ly. Wait 1 million years. What amino acids are now found at that position? PAM1 = Wait another million years. PAM1 = But,... PAM1 is just PAM1 PAM1

27 250 million years? PAM1 PAM1 PAM1 = 250 PAM250 = PAM1 The number after PAM denotes the power to which PAM1 was taken.

28 NOTE OF CLARIFICATION: PAM does not stand for Plus A Million years (or anything like that). It stands for Percent Accepted Mutations. One PAM1 unit does not correspond to 1 million years of evolution. There is no timescale associated with PAM. PAM1 corresponds to 1% mutations. (or 99% identity). The timescale depends on the species. 28

29 Differences between PAM and BLOSUM PAM PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1 using an assumed Markov chain. BLOSUM BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with approx 62% identity. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM 62 is the default matrix in BLAST (the database search program). It is tailored for comparisons of moderately distant proteins. Alignment of distant relatives may be more accurate with a different matrix.

30 PAM250

31 BLOSUM62

32 In class exercise: Which substitution matrix favors... PAM250 BLOSUM62 conservation of polar residues conservation of non-polar residues conservation of C, Y, or W polar-to-nonpolar mutations polar-to-polar mutations

33 Protein versus DNA alignments Are protein alignment better? Protein alphabet = 20, DNA alphabet = 4. Protein alignment is more informative Less chance of homoplasy with proteins. Homology detectable at greater edit distance Protein alignment more informative Better old Standard alignments are available for proteins. Better statistics from.s. alignments. On the other hand, DNA alignments are more sensitive to short evolutionary distances. 33

34 DNA evolutionary models: P-distance What is the relationship between time and the %identity? p = D L evolutionary time 0 p 1 p is a good measure of time only when p is small. 34

35 DNA evolutionary models: Poisson correction Corrects for multiple mutations at the same site. Unobserved mutations. p = D L 1-p = e -2rt d P = 2rt dp = -ln(1-p) The fraction unchanged decays according to the Poisson function. In the time t since the common ancestor, 2rt mutations have occurred, where r is the mutation rate (r = genetic drift * selection pressure) evolutionary time 0 p 1 Poisson correction assumes p goes to 1 at t=. Where should it really go? 35

36 DNA evolutionary models: Jukes-Cantor What is the relationship between true evolutionary distance and p-distance? Prob(mutation in one unit of time) = α Prob(no mutation) = 1-3α A C α << 1. T A 1-3α α α α At time t, fraction identical is q(t). Fraction non-identical is p(t). p(t) + q(t) = 1 In time t+1, each of q(t) positions stays same with prob = 1-3α. C T α α α 1-3α α α α 1-3α α α α 1-3α Prob that both sequences do not mutate = (1-3α) 2 =(1-6α+9α 2 ) (1-6α). ( Since α<<1, we can safely neglect α 2. ) Prob that a mismatch mutates back to an identity = 2αp(t) q(t+1) = q(t)(1-6α) + 2α(1-q(t)) d q(t)/dt q(t+1) - q(t) = 2α - 8α q(t) Integrating: q(t) = (1/4)(1 + 3exp(-8αt)) Solving for d JC = 6αt = -(3/4)ln(1 - (4/3)p), where p is the p-distance. 36

37 DNA evolutionary models evolutionary time 0 p 1 p-distance Poisson correction Jukes-Cantor In Jukes-Cantor, p limits to p=0.75 at infinite evolutionary distance. 37

38 Transitions/transversions A C In DNA replication, errors can be transitions (purine for purine, pyrimidine for pyrimidine) or transversions (purine for pyrimidine & vice versa) R = transitions/transversions. R would be 1/2 if all mutations were equally likely. In DNA alignments, R is observed to be about 4. Transitions are greatly favored over transversions. T 38

39 Jukes-Cantor with correction for transitions/ transversions (Kimura 2-parameter model, dk2p) A C A C T 1-2β-α β β α β 1-2β-α α β β α β β α 1-2β-α β 1-2β-α Split changes (D) into the two types, transition (P) and transversion (Q) p-distance = D/L = P + Q P = transitions/l, Q=transversions/L The the corrected evolutionary distance is... d K2P =-(1/2)ln(1-2P-Q)-(1/4)ln(1-2Q) 39

40 Further corrections are possible, but... Do we need a nucleotide substitution matrix? A C T A C Additional corrections possible for: Sequence position (gamma) Isochores (C-rich, AT-rich regions) Methylated, not methylated.

41 Review questions What are we asserting about ancestry by aligning the sequences? What kind of data is used to generate a substitution matrix? What makes the PAM matrix Markovian? When should you align amino acid sequence, when DNA? When is the P-distance an accurate measure of time? Why is time versus p-distance Poisson? Why are transitions more common the transversions? What does Kimura do that Jukes-Cantor does not? 41

Amino Acids and Their Properties

Amino Acids and Their Properties Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

Introduction to Phylogenetic Analysis

Introduction to Phylogenetic Analysis Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.

More information

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:

More information

Clone Manager. Getting Started

Clone Manager. Getting Started Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

BLAST. Anders Gorm Pedersen & Rasmus Wernersson BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise

More information

Phylogenetic Trees Made Easy

Phylogenetic Trees Made Easy Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.

More information

Bayesian Phylogeny and Measures of Branch Support

Bayesian Phylogeny and Measures of Branch Support Bayesian Phylogeny and Measures of Branch Support Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The

More information

Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS

Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary

More information

The Central Dogma of Molecular Biology

The Central Dogma of Molecular Biology Vierstraete Andy (version 1.01) 1/02/2000 -Page 1 - The Central Dogma of Molecular Biology Figure 1 : The Central Dogma of molecular biology. DNA contains the complete genetic information that defines

More information

Introduction to Bioinformatics 3. DNA editing and contig assembly

Introduction to Bioinformatics 3. DNA editing and contig assembly Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

DNA Sequence Alignment Analysis

DNA Sequence Alignment Analysis Analysis of DNA sequence data p. 1 Analysis of DNA sequence data using MEGA and DNAsp. Analysis of two genes from the X and Y chromosomes of plant species from the genus Silene The first two computer classes

More information

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999 Dr Clare Sansom works part time at Birkbeck College, London, and part time as a freelance computer consultant and science writer At Birkbeck she coordinates an innovative graduate-level Advanced Certificate

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML 9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze

More information

BIOINFORMATICS TUTORIAL

BIOINFORMATICS TUTORIAL Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

PHYLOGENETIC ANALYSIS

PHYLOGENETIC ANALYSIS Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D. Baxevanis, B.F. Francis Ouellette Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-38390-2 (Hardback);

More information

Worksheet - COMPARATIVE MAPPING 1

Worksheet - COMPARATIVE MAPPING 1 Worksheet - COMPARATIVE MAPPING 1 The arrangement of genes and other DNA markers is compared between species in Comparative genome mapping. As early as 1915, the geneticist J.B.S Haldane reported that

More information

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Using MATLAB: Bioinformatics Toolbox for Life Sciences Using MATLAB: Bioinformatics Toolbox for Life Sciences MR. SARAWUT WONGPHAYAK BIOINFORMATICS PROGRAM, SCHOOL OF BIORESOURCES AND TECHNOLOGY, AND SCHOOL OF INFORMATION TECHNOLOGY, KING MONGKUT S UNIVERSITY

More information

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Data for phylogenetic analysis

Data for phylogenetic analysis Data for phylogenetic analysis The data that are used to estimate the phylogeny of a set of tips are the characteristics of those tips. Therefore the success of phylogenetic inference depends in large

More information

Computational searches of biological sequences

Computational searches of biological sequences UNAM, México, Enero 78 Computational searches of biological sequences Special thanks to all the scientis that made public available their presentations throughout the web from where many slides were taken

More information

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among A branch-and-bound algorithm for the inference of ancestral amino-acid sequences when the replacement rate varies among sites Tal Pupko 1,*, Itsik Pe er 2, Masami Hasegawa 1, Dan Graur 3, and Nir Friedman

More information

Core Bioinformatics. Degree Type Year Semester

Core Bioinformatics. Degree Type Year Semester Core Bioinformatics 2015/2016 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat Teachers Use of

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015. Probability, statistics and football Franka Miriam Bru ckler Paris, 2015 Please read this before starting! Although each activity can be performed by one person only, it is suggested that you work in groups

More information

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

Bayesian coalescent inference of population size history

Bayesian coalescent inference of population size history Bayesian coalescent inference of population size history Alexei Drummond University of Auckland Workshop on Population and Speciation Genomics, 2016 1st February 2016 1 / 39 BEAST tutorials Population

More information

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference Stephane Guindon, F. Le Thiec, Patrice Duroux, Olivier Gascuel To cite this version: Stephane Guindon, F. Le Thiec, Patrice

More information

Principles of Evolution - Origin of Species

Principles of Evolution - Origin of Species Theories of Organic Evolution X Multiple Centers of Creation (de Buffon) developed the concept of "centers of creation throughout the world organisms had arisen, which other species had evolved from X

More information

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker Multiple Sequence Alignment Hot Topic 5/24/06 Kim Walker Outline Why are Multiple Sequence Alignments useful? What Tools are Available? Brief Introduction to ClustalX Tools to Edit and Add Features to

More information

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question. Name: Class: Date: Chapter 17 Practice Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The correct order for the levels of Linnaeus's classification system,

More information

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute

More information

Core Bioinformatics. Titulació Tipus Curs Semestre. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Titulació Tipus Curs Semestre. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Codi: 42397 Crèdits: 12 Titulació Tipus Curs Semestre 4313473 Bioinformàtica/Bioinformatics OB 0 1 Professor de contacte Nom: Sònia Casillas Viladerrams Correu electrònic:

More information

morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo

morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo ftp://ftp.pasteur.fr/pub/gensoft/projects/morephyml/ http://mobyle.pasteur.fr/cgi-bin/portal.py Please cite this paper if you use this

More information

Genome Explorer For Comparative Genome Analysis

Genome Explorer For Comparative Genome Analysis Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence

More information

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Hierarchical Bayesian Modeling of the HIV Response to Therapy Hierarchical Bayesian Modeling of the HIV Response to Therapy Shane T. Jensen Department of Statistics, The Wharton School, University of Pennsylvania March 23, 2010 Joint Work with Alex Braunstein and

More information

MAKING AN EVOLUTIONARY TREE

MAKING AN EVOLUTIONARY TREE Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1 Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1 Ziheng Yang Department of Animal Science, Beijing Agricultural University Felsenstein s maximum-likelihood

More information

MEGA. Molecular Evolutionary Genetics Analysis VERSION 4. Koichiro Tamura, Joel Dudley Masatoshi Nei, Sudhir Kumar

MEGA. Molecular Evolutionary Genetics Analysis VERSION 4. Koichiro Tamura, Joel Dudley Masatoshi Nei, Sudhir Kumar MEGA Molecular Evolutionary Genetics Analysis VERSION 4 Koichiro Tamura, Joel Dudley Masatoshi Nei, Sudhir Kumar Center of Evolutionary Functional Genomics Biodesign Institute Arizona State University

More information

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using

More information

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

A comparison of methods for estimating the transition:transversion ratio from DNA sequences Molecular Phylogenetics and Evolution 32 (2004) 495 503 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev A comparison of methods for estimating the transition:transversion ratio from

More information

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011 Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear

More information

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

Module 10: Bioinformatics

Module 10: Bioinformatics Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

Molecular Clocks and Tree Dating with r8s and BEAST

Molecular Clocks and Tree Dating with r8s and BEAST Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today

More information

Mathematics and Statistics: Apply probability methods in solving problems (91267)

Mathematics and Statistics: Apply probability methods in solving problems (91267) NCEA Level 2 Mathematics (91267) 2013 page 1 of 5 Assessment Schedule 2013 Mathematics and Statistics: Apply probability methods in solving problems (91267) Evidence Statement with Merit Apply probability

More information

MASCOT Search Results Interpretation

MASCOT Search Results Interpretation The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

More information

Biological Sequence Data Formats

Biological Sequence Data Formats Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA

More information

Systematics - BIO 615

Systematics - BIO 615 Outline - and introduction to phylogenetic inference 1. Pre Lamarck, Pre Darwin Classification without phylogeny 2. Lamarck & Darwin to Hennig (et al.) Classification with phylogeny but without a reproducible

More information

DNA Printer - A Brief Course in sequence Analysis

DNA Printer - A Brief Course in sequence Analysis Last modified August 19, 2015 Brian Golding, Dick Morton and Wilfried Haerty Department of Biology McMaster University Hamilton, Ontario L8S 4K1 ii These notes are in Adobe Acrobat format (they are available

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

More information

To discuss this topic fully, let us define some terms used in this and the following sets of supplemental notes.

To discuss this topic fully, let us define some terms used in this and the following sets of supplemental notes. INFINITE SERIES SERIES AND PARTIAL SUMS What if we wanted to sum up the terms of this sequence, how many terms would I have to use? 1, 2, 3,... 10,...? Well, we could start creating sums of a finite number

More information

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches Ed Huai-hsin Chi y, John Riedl y, Elizabeth Shoop y, John V. Carlis y, Ernest Retzel z, Phillip Barry

More information

An Extension Model of Financially-balanced Bonus-Malus System

An Extension Model of Financially-balanced Bonus-Malus System An Extension Model of Financially-balanced Bonus-Malus System Other : Ratemaking, Experience Rating XIAO, Yugu Center for Applied Statistics, Renmin University of China, Beijing, 00872, P.R. China Phone:

More information

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

People have thought about, and defined, probability in different ways. important to note the consequences of the definition: PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models

More information

T cell Epitope Prediction

T cell Epitope Prediction Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments

More information

2.3 Identify rrna sequences in DNA

2.3 Identify rrna sequences in DNA 2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by

More information

Methods for big data in medical genomics

Methods for big data in medical genomics Methods for big data in medical genomics Parallel Hidden Markov Models in Population Genetics Chris Holmes, (Peter Kecskemethy & Chris Gamble) Department of Statistics and, Nuffield Department of Medicine

More information

A Rough Guide to BEAST 1.4

A Rough Guide to BEAST 1.4 A Rough Guide to BEAST 1.4 Alexei J. Drummond 1, Simon Y.W. Ho, Nic Rawlence and Andrew Rambaut 2 1 Department of Computer Science The University of Auckland, Private Bag 92019 Auckland, New Zealand alexei@cs.auckland.ac.nz

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Hidden Markov Models

Hidden Markov Models 8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

More information

PhyML Manual. Version 3.0 September 17, 2008. http://www.atgc-montpellier.fr/phyml

PhyML Manual. Version 3.0 September 17, 2008. http://www.atgc-montpellier.fr/phyml PhyML Manual Version 3.0 September 17, 2008 http://www.atgc-montpellier.fr/phyml Contents 1 Citation 3 2 Authors 3 3 Overview 4 4 Installing PhyML 4 4.1 Sources and compilation.............................

More information

Arbres formels et Arbre(s) de la Vie

Arbres formels et Arbre(s) de la Vie Arbres formels et Arbre(s) de la Vie A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Basic principles DATA sequence alignment distance

More information

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading: Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein

More information

Summary. 16 1 Genes and Variation. 16 2 Evolution as Genetic Change. Name Class Date

Summary. 16 1 Genes and Variation. 16 2 Evolution as Genetic Change. Name Class Date Chapter 16 Summary Evolution of Populations 16 1 Genes and Variation Darwin s original ideas can now be understood in genetic terms. Beginning with variation, we now know that traits are controlled by

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Umm AL Qura University MUTATIONS. Dr Neda M Bogari Umm AL Qura University MUTATIONS Dr Neda M Bogari CONTACTS www.bogari.net http://web.me.com/bogari/bogari.net/ From DNA to Mutations MUTATION Definition: Permanent change in nucleotide sequence. It can

More information

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis Finding lusters in Phylogenetic Trees: Special Type of luster nalysis Why try to identify clusters in phylogenetic trees? xample: origin of HIV. NUMR: Why are there so many distinct clusters? LUR04-7 SYNHRONY:

More information

REWARD System For Even Money Bet in Roulette By Izak Matatya

REWARD System For Even Money Bet in Roulette By Izak Matatya REWARD System For Even Money Bet in Roulette By Izak Matatya By even money betting we mean betting on Red or Black, High or Low, Even or Odd, because they pay 1 to 1. With the exception of the green zeros,

More information

Production Functions and Cost of Production

Production Functions and Cost of Production 1 Returns to Scale 1 14.01 Principles of Microeconomics, Fall 2007 Chia-Hui Chen October, 2007 Lecture 12 Production Functions and Cost of Production Outline 1. Chap 6: Returns to Scale 2. Chap 6: Production

More information

Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence Analysis.

Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence Analysis. Steven M. Thompson Manuscript for Multiple Sequence Alignment and Analysis Page 1 3/31/04 Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence

More information

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE AP Biology Date SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE LEARNING OBJECTIVES Students will gain an appreciation of the physical effects of sickle cell anemia, its prevalence in the population,

More information

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs. How is Australia s Higher Education Performing? An analysis of completion rates of a cohort of Australian Post Graduate Research Students in the 1990s. Yew May Martin Maureen Maclachlan Tom Karmel Higher

More information

Stochastic Gene Expression in Prokaryotes: A Point Process Approach

Stochastic Gene Expression in Prokaryotes: A Point Process Approach Stochastic Gene Expression in Prokaryotes: A Point Process Approach Emanuele LEONCINI INRIA Rocquencourt - INRA Jouy-en-Josas Mathematical Modeling in Cell Biology March 27 th 2013 Emanuele LEONCINI (INRIA)

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...

More information

Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

NO CALCULATORS OR CELL PHONES ALLOWED

NO CALCULATORS OR CELL PHONES ALLOWED Biol 205 Exam 1 TEST FORM A Spring 2008 NAME Fill out both sides of the Scantron Sheet. On Side 2 be sure to indicate that you have TEST FORM A The answers to Part I should be placed on the SCANTRON SHEET.

More information