Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center
Topics Why do protein sequence analysis? Searching sequence databases (similarity search) Post-processing search results Protein classification & function prediction. Detecting remote homologs Multiple sequence alignment and Phylogenetic analysis
Protein bioinformatics: protein sequence analysis Helps characterize protein sequences in silico and allows prediction of protein structure and function Statistically significant BLAST hits usually signifies sequence homology Homologous sequences may or may not have the same function but would always (very few exceptions) have the same structural fold Protein sequence analysis allows protein classification
Comparative protein sequence analysis and evolution Patterns of conservation in sequences allows us to determine which residues are under selective constraint (and thus likely important for protein function) Comparative analysis of proteins is more sensitive than comparing DNA Homologous proteins have a common ancestor Different proteins evolve at different rates Protein classification systems based on evolution: PIRSF and COG
Comparing proteins Amino acid sequence of protein generated from proteomics experiment e.g. protein fragment DTIKDLLPNVCAFPMEKGPCQTYMTRWFFNFETGECELFAYGGCGGNSNNFLRKEKCEKFCKFT Amino-acids of two sequences can be aligned and we can easily count the number of identical residues (or use an index of similarity) as a measure of relatedness. Protein structures can be compared by superimposition
Protein sequence alignment Pairwise alignment a b a c d a b _ c d Multiple sequence alignment provides more information a b a c d a b _ c d x b a c e MSA difficult to do for distantly related proteins
Protein sequence analysis overview Protein databases PIR (pir.georgetown.edu) and UniProt (www.uniprot.org) Searching databases Peptide search, BLAST search, Text search Information retrieval and analysis Protein records at UniProt and PIR Multiple sequence alignment Secondary structure prediction Homology modeling
Query Sequence Unknown sequence is Q9I7I7 BLAST Q9I7I7 against the UniProt Knowledgebase (http://www.uniprot.org/search/blast.shtml) Analyze results
BLAST results
SIR2_HUMAN protein record
Are Q9I7I7 and SIR2_HUMAN homologs? Check BLAST results Check pairwise alignment
Protein structure prediction Programs can predict secondary structure information with 70% accuracy Homology modeling - prediction of target structure from closely related template structure
Secondary structure prediction http://bioinf.cs.ucl.ac.uk/psipred/
Secondary structure prediction results
Sir2 structure
Homology modeling http://www.expasy.org/swissmod/swiss-model.html
Homology model of Q9I7I7 Blue - excellent Green - so so Red - not good Yellow - beta sheet Red - alpha helix Grey - loop
Sequence features: SIR2_HUMAN
Multiple sequence alignment
Multiple sequence alignment Q9I7I7, Q82QG9, SIR2_HUMAN
Identifying Remote Homologs
Function prediction
Function prediction
Molecular Phylogenetics and Evolution Overview History of phylogenetics Sequence analysis and classification Methods in phylogenetic analysis
Phylogenetics Field of biology that studies the evolutionary relationships between organisms, proteins or genes that share a common ancestor Phylogenetics includes the discovery (estimation) of these relationships, and the study of the causes behind this pattern Phylogenetics is related taxonomy
Tree of Life Aristotle (384 BC 322 BC), classified all living organisms as either a plant or an animal. Whittaker (1969), summarized the "Five Kingdoms" of life: animals, plants, fungi, protists ("protozoa"), and monera (bacteria). R. H. Whittaker, Science 163, 150 (1969) Zuckerkandl et al. (1965) forwarded the concept that sequences could be used to relate organisms. E. Zuckerkandl et al. Biol. 8, 357 (1965). Woese (1990) proposed "urkingdoms" or "domains": Eucarya (eukaryotes), Bacteria (initially called eubacteria), and Archaea (initially called archaebacteria). Woese et al.proc. Natl. Acad. Sci. U.S.A. 87, 4576 (1990). Norman R. Pace. 1997. Science Vol. 276. 734-740
History of Phylogenetics Charles Darwin.1859. Author of The Origin of Species Ernst Haeckel. 1892. Mapped a genealogical tree relating all animal life. Romanes's 1892 copy of Ernst Haeckel's allegedly fraudulent embryo drawings.
Monophyly, Paraphyly & Polyphyly Phylogenetics Wikipedia
Molecular Phylogenetics Morphological or organismal character evolution not as consistent compared to molecular evolution Can be used to study any organism Rates of evolution can be studied in greater detail Abundant data available
Evolutionary Change in DNA Several models have been proposed to study the mechanisms of DNA evolution Jukes and Cantor s One- Parameter Model assumes no bias in the direction of change so the substitution occur randomly among four types of nucleotides. Kimura s Two-Parameter model transitions are generally more frequent than transversions. The rate of transitional substitution is different than the rate of transversional substitution Rate of change is dependent upon the rate of substitution and pattern of substitution A C T G A > C > T A C > G G T > A A A > C > T C G C Ancestral sequence A C T G A A C G T A A C G C A C > A T G A A C > A G T > A A A > T C G C > T > C Sequence 1 Sequence 2 Single substitution Multiple substitution Coincidental substitution Parallel substitution Convergent substitution Back substitution From Li and Graur 1991
Evolutionary Change in Protein Synonymous and nonsynonymous substitutions: Substitutions that result in amino acid replacements are said to be nonsynonymous while substitutions that do not cause an amino acid replacement are said to be synonymous substitutions Changes within the same amino acid classes. Example, hydrophobic, charged, etc.
Tutorial Retrieve 1FSI (PDB id) sequence and related sequences from UniProtKB using BLAST Align all the sequences in Clustal (desktop version) Generate tree (using Clustal) View tree (http://www.phylowidget.org/; http://www.proweb.org/treeviewer/)
Representation Of Phylogeny The evolutionary relationship between two proteins can be represented in the form of a tree A phylogeny is a bifurcating tree with nodes and branches and a root (represents the common ancestor) clade Branch Protein 1a Node Root Protein 1b Protein 1c Protein 1d Homologous proteins
Terminology Clade A monophyletic taxon Taxon any named group of organisms; not necessarily a clade Branches branches connect nodes Nodes any bifurcating branch point
Common Phylogenetic Tree Layout rectangular cladogram slanted cladogram Phylogram (branch lengths proportional to distance) Radial 11
Rooted vs. Unrooted Phylogenies R unrooted rooted only relationships not the evolutionary path root (R) is the common ancestor
How to Construct A Phylogenetic Tree Construct a multiple sequence alignment Determine the substitution model Build tree Evaluate tree
Bootstrapping Bootstrapping is a resampling tree evaluation method A number associated with a particular branch in the tree that gives the proportion of bootstrap replicates that support the monophyly of the clade Two-step process generation of many new data sets from the original set and then the computation of a number that tells how often a particular branch appears in the tree
Distance - Neighbor-joining Method NJ algorithm commonly is applied with distance tree building The fully resolved tree is decomposed from a fully unresolved star tree by inserting branches between a pair of closest neighbors and the remaining terminals in the tree. The process is repeated. Rapid method.
Function Prediction From Evolutionary Classification Example PFK: Phosphofructokinase classification revealed that major functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue. Families E. coli (P06998) Gly105 Gly125 Classification tree ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB PPi_PFK_TM0289 PPi_PFK_TP0108 PPi_PFK_SMc01852 ATP-PFK: Gly105 + Gly125 PPi-PFK: Gly/Asp105 + Lys125 PFK_XF0274
Contact Myself- rm285@georgetown.edu UniProt- help@uniprot.org PIR- pirmail@georgetown.edu