Secondary structure assignment. Secondary structure assignment and prediction. Talk overview

Similar documents
Peptide bonds: resonance structure. Properties of proteins: Peptide bonds and side chains. Dihedral angles. Peptide bond. Protein physics, Lecture 5

Lecture 19: Proteins, Primary Struture

The peptide bond is rigid and planar

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Peptide Bonds: Structure

(c) How would your answers to problem (a) change if the molecular weight of the protein was 100,000 Dalton?

Built from 20 kinds of amino acids

Helices From Readily in Biological Structures

Amino Acids. Amino acids are the building blocks of proteins. All AA s have the same basic structure: Side Chain. Alpha Carbon. Carboxyl. Group.

Disulfide Bonds at the Hair Salon

The peptide bond Peptides and proteins are linear polymers of amino acids. The amino acids are

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK

Myoglobin and Hemoglobin

Secondary Structure Prediction. Michael Tress CNIO

Bioinformatics for Biologists. Protein Structure

18.2 Protein Structure and Function: An Overview

Hydrogen Bonds The electrostatic nature of hydrogen bonds

Pipe Cleaner Proteins. Essential question: How does the structure of proteins relate to their function in the cell?

Protein Physics. A. V. Finkelstein & O. B. Ptitsyn LECTURE 1

PROTEINS THE PEPTIDE BOND. The peptide bond, shown above enclosed in the blue curves, generates the basic structural unit for proteins.

Structure Tools and Visualization

Biological Molecules

Structure of proteins

Disaccharides consist of two monosaccharide monomers covalently linked by a glycosidic bond. They function in sugar transport.

Combinatorial Biochemistry and Phage Display

Replication Study Guide

Recap. Lecture 2. Protein conformation. Proteins. 8 types of protein function 10/21/10. Proteins.. > 50% dry weight of a cell

RNA & Protein Synthesis

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

This class deals with the fundamental structural features of proteins, which one can understand from the structure of amino acids, and how they are

AP BIOLOGY 2008 SCORING GUIDELINES

Linear Sequence Analysis. 3-D Structure Analysis

Chapter 6 DNA Replication


Proteins and Nucleic Acids

4. Which carbohydrate would you find as part of a molecule of RNA? a. Galactose b. Deoxyribose c. Ribose d. Glucose

Introduction to Proteins and Enzymes

Antibody responses to linear and conformational epitopes

A. A peptide with 12 amino acids has the following amino acid composition: 2 Met, 1 Tyr, 1 Trp, 2 Glu, 1 Lys, 1 Arg, 1 Thr, 1 Asn, 1 Ile, 1 Cys

DNA Worksheet BIOL 1107L DNA

Discrete representations of the protein C. chain Xavier F de la Cruz 1, Michael W Mahoney 2 and Byungkook Lee

The Lipid Bilayer Is a Two-Dimensional Fluid

FTIR Analysis of Protein Structure

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Amino Acids and Proteins

Proteins. Proteins. Amino Acids. Most diverse and most important molecule in. Functions: Functions (cont d)

Consensus alignment server for reliable comparative modeling with distant templates

agucacaaacgcu agugcuaguuua uaugcagucuua

Pairwise Sequence Alignment

Papers listed: Cell2. This weeks papers. Chapt 4. Protein structure and function

Chapter 12 - Proteins

Paper: 6 Chemistry University I Chemistry: Models Page: 2 of Which of the following weak acids would make the best buffer at ph = 5.0?

Carbohydrates, proteins and lipids

Structure Check. Authors: Eduard Schreiner Leonardo G. Trabuco. February 7, 2012

Protein Structure Prediction and Analysis Tools Jianlin Cheng, PhD

Lectures 2 & 3. If the base pair is imbedded in a helix, then there are several more angular attributes of the base pair that we must consider:

Chapter 3 Molecules of Cells

Chapter 6. The stacking ensemble approach

Role of Hydrogen Bonding on Protein Secondary Structure Introduction

RNA Structure and folding

Overview'of'Solid-Phase'Peptide'Synthesis'(SPPS)'and'Secondary'Structure'Determination'by'FTIR'

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

MCAT Organic Chemistry - Problem Drill 23: Amino Acids, Peptides and Proteins

Part A: Amino Acids and Peptides (Is the peptide IAG the same as the peptide GAI?)

Chapter 5. The Structure and Function of Macromolecule s

Computational Systems Biology. Lecture 2: Enzymes

Lecture Overview. Hydrogen Bonds. Special Properties of Water Molecules. Universal Solvent. ph Scale Illustrated. special properties of water

Chapter 3: Biological Molecules. 1. Carbohydrates 2. Lipids 3. Proteins 4. Nucleic Acids

Structures of Proteins. Primary structure - amino acid sequence

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996

K'NEX DNA Models. Developed by Dr. Gary Benson Department of Biomathematical Sciences Mount Sinai School of Medicine

A disaccharide is formed when a dehydration reaction joins two monosaccharides. This covalent bond is called a glycosidic linkage.

INTRODUCTION TO PROTEIN STRUCTURE

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler

Non-Covalent Bonds (Weak Bond)

Nafith Abu Tarboush DDS, MSc, PhD

Protein Secondary Structure Prediction: Novel Methods and Software Architectures

BIOLOGICAL MEMBRANES: FUNCTIONS, STRUCTURES & TRANSPORT

FLUORESCENT PROTEINS - XFPs

Biological molecules:

A reduced model of short range interactions in polypeptide chains

How To Understand The Chemistry Of Organic Molecules

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Protein annotation and modelling servers at University College London

Introduction to Protein Folding

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Chapter 5: The Structure and Function of Large Biological Molecules

Chapter 3. Protein Structure and Function

Supplementary Figures S1 - S11

Invariant residue-a residue that is always conserved. It is assumed that these residues are essential to the structure or function of the protein.

Activity 7.21 Transcription factors

Neural Networks and Support Vector Machines

Introduction to Principal Components and FactorAnalysis

Neural Network Design in Cloud Computing

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

Proteins the primary biological macromolecules of living organisms

Describe the process of parallelization as it relates to problem solving.

thebiotutor. AS Biology OCR. Unit F211: Cells, Exchange & Transport. Module 1.2 Cell Membranes. Notes & Questions.

IV. -Amino Acids: carboxyl and amino groups bonded to -Carbon. V. Polypeptides and Proteins

Transcription:

Talk overview Secondary structure assignment and prediction Secondary structure assignment Why to predict secondary structures in proteins Methods to predict secondary structures in proteins Machine learning approaches Detailed description of several specific programs (PHD) May 2011 Eran Eyal Performance and evaluation Automatic assignment of secondary structures to a set of protein coordinates Assignment of secondary structures to known secondary structures is a relatively simple bioinformatics task. Given exact definitions for secondary structures, all we need to do is to see which part of the structure falls within each definition α-helix

Why to automatically and routinely assign secondary structures? Standardization Easy visualization Detection of structural motifs and improved sequence-structure searches Β-strand Structural alignment Structural classification What basic structural information is used? q > 120 and r HO < 2.5 Ǻ Hydrogen bond patterns Backbone dihedral angles

DSSP algorithm The so-called Dictionary of Secondary Structure of Proteins (DSSP) by Kabsch and Sander makes its sheet and helix assignments solely on the basis of backbone-backbone hydrogen bonds. The DSSP method defines a hydrogen bond when the bond energy is below -0.5 kcal/mol from a Coulomb approximation of the hydrogen bond energy. The structural assignments are defined such that visually appealing and unbroken structures result. In case of overlaps, alpha-helix is given first priority. The helix definition does not include the terminal residue having the initial and final hydrogen bonds in the helix. A minimal size helix is set to have two consecutive hydrogen bonds in the helix, leaving out single helix hydrogen bonds, which are assigned as turns (state 'T'). beta-sheet residues (state 'E') are defined as either having two hydrogen bonds in the sheet, or being surrounded by two hydrogen bonds in the sheet. The minimal sheet consists of two residues at each partner segment. STRIDE The secondary STRuctural IDEntification method by Frishman and Argos uses an empirically derived hydrogen bond energy and phipsi torsion angle criteria to assign secondary structure. Torsion angles are given alpha-helix and beta-sheet propensities according to how close they are to their regions in Ramachandran plots. The parameters are optimized to mirror visual assignments made by crystallographers for a set of proteins. By construction, the STRIDE assignments agreed better with the expert assignments than DSSP, at least for the data set used to optimize the free parameters.

Like DSSP, STRIDE assigns the shortest alpha-helix ('H') if it contains at least two consecutive i - i+4 hydrogen bonds. In contrast to DSSP, helices are elongated to comprise one or both edge residues if they have acceptable phi-psi angles, similarly a short helix can be vetoed. hydrogen bond patterns may be ignored if the phi-psi angles are unfavorable. The sheet category does not distinguish between parallel and anti-parallel sheets. The minimal sheet ('E') is composed of two residues. The dihedral angles are incorporated into the final sheet assignment criterion as was done for the alpha-helix. DEFINE An algorithm by Richards and Kundrot which assigns secondary structures by matching Cα-coordinates with a linear distance mask of the ideal secondary structures. First, strict matches are found, which subsequently are elongated and/or joined allowing moderate irregularities or curvature. The algorithm locates the starts and ends of α- and 3 10 -helices, beta-sheets, turns and loops. With these classifications the authors are able to assign 90-95% of all residues to at least one of the given secondary structure classes.

Secondary structure prediction Prediction of tertiary structures based on the amino acid sequence is still a very difficult task. Prediction of more local structural properties is easier Prediction of secondary structures and solvent accessibility (SAS) is important and more feasible Prediction of secondary structures is a bridge between the linear information and the 3D structure A-C-H-Y-T-T-E-K-R-G-G-S-G-T-K-K-R-E-A Programs in this field often employ different types of machine learning approaches A-C-H-Y-T-T-E-K-R-G-G-S-G-T-K-K-R-E-A H-H-H-H-H-H-H-H-O-O-O-O-O-S-S-S-S-S-S

The importance and the need of predicting the secondary structures in proteins The information might give clues concerning the function of the protein and the existing of specific structural motifs Intermediate step toward construction of a complete 3D model from the sequence. Many degrees of freedom Long search. Pruned to errors Few degrees of freedom Fast search Secondary structure content also allows us to classify a protein to the basic levels of structure type based on its sequence alone.

Generations in algorithm development The Chou-Fasman method First generation: uses statistics regarding preferences of individual amino acids. Each amino acid has preferences regarding appearance in secondary structures. This can be determined by counting amino acids in different secondary structures in known solved structures.

Second generation: the improvements comparing to the first were the uses of better statistics and statistical methods, and by looking on a set of adjacent amino acids on the sequence (usually windows of 11-21 amino acids) rather than on individual amino acids The new statistics determined what is the probability of an amino acid to be in a particular secondary structure given that it is in the middle of a local sequence segment. Other segments similar to the given segments might also assist in the prediction. Different methods tried to correspond the segments to other segments in the 3D database by sequence alignments and other methods. The GOR method Strand table Helix table

General problems of methods in generations I,II Overall prediction rate was rather low: Overall prediction: 60% B-strands prediction: ~35% Predictions included small secondary elements, with disability to integrate them to longer structures such as those found in protein structures. Third generation: the improvement of the programs in the third generation was mainly due to incorporation of evolutionary information. This was done by looking at the multiple alignment which included sequences similar to the sequence we wish to predict. Such information presented as MSA or by other way include plenty of information which can not be obtained from evolutionary sequences: Which regions are more conserved which substitution are allowed in each position Information regarding interacting sites Comparison of many sequences of protein families helps to detect conserved regions Comparison of many sequences of protein families helps to detect interactions in space SAARDFFRT--HAAGRFFTFT SAARDFFRS--GTRAKFFTFT TAARDFFRF-GKAA-KFFTFT SAARRFFRTGDHAALDFFTFT SAARRFFRWHGLAAIDFFTFT AAARDFFRTGGHAAGRFFTFT AAARDFFRSGGHAAGKFFTFT AAARDFFRTGGHAAGKFFTFT AAARRFFRTGAHAAGDFYTFS AAARRFFRTGGHAAGDFFTFT

Information obtained from MSA might help in the prediction. Because the fold of all members of the family is identical, every sequence can contribute the structure prediction of other given sequence in the family The best MSA for this purpose is one which includes many sequences of the family but being not too close one to another Introduction to neural networks Neurons cells are the basic components of the nerve system Every neuron gets information from several other neurons by the dendrites The information is being processed and the neuron makes binary decision if to transfer a signal to other neurons The information is transferred by the axon Computational tasks that the nerve system executes: Representation of data Holding data Learning procedures Decision making Pattern recognition

Neural networks - properties System which are composed of many simple processors connected and work in parallel. The information may be obtained by learning process and stored in the connections between the processors The perceptron The perceptron models the action of a single neuron, it can be used to classify only linearly separable cases. Example: binary neuron Example: binary neuron Inputs: S i = 01, Inputs: S i = 01, Output: Θ( W S1 + W2S 2 1 T) Output: Θ( W S1 + W2S 2 1 T) AND gate שער OR s 1 s 2 1 1 W 1 =? W 2 =? 1.5 s 1 s 2 W 1 =? 1 W1 2 =? 0.5

In practice, usually some differentiable function is used instead of the step function Networks of layers Input Networks with feedback Internal representation Output

Training Preparation of a large training set The neural network gets the input and random initial values for the parameters (weights) The network tries to maximize the number of correctly predicted cases by changes in the values of the parameters (weights) To test the net we evaluate its performance on a collection of solved examples (test set) The test set should be independent of the training set. The first interaction of the net with this set should be done during evaluation The test set should be large and representative. It is better to use test set already used for evaluation of other programs designed to solve similar task

PHD a third generation program the uses neural networks. PhD is the most popular secondary structure prediction program, although other programs reach the same accuracy, it is still very popular today The versions of this program implement and demonstrate the recent elements which are considered the most important for prediction accuracy Demonstrates the use of machine learning approaches in this field Input: sequence of amino acids. Using data base sequence alignment, similar alignments are found and MSA is built The composition of this alignment is the input to the neural network which is the core of the program Every position in the input sequence is expressed by 21 parameters: the prectage of each amino acid in that position and another character which indicate the start/end of the sequence In addition the input for each position includes global information about the protein composition and the sequence distance between the predicted region to the start/end positions

Important of variability in the input sequences Good alignment! The neural network includes several layers: Input layer: sequence -> structure Intermediate layer: structure -> structure Output system: summation of several networks Output: the secondary structure with the highest score is the final prediction for that position

http://www.embl-heidelberg.de/predictprotein/

Comparison of secondary structure prediction tools Assignment Reliability index Prediction periplasmic binding protein 4mbp

Reliability index -PHD Combination of different prediction methods Every method has errors which can be classified to 2 general types: 1.Systematic errors 2.Non-systematic errors Several methods can be therefore combined to increase the prediction accuracy The basic condition to successful combination is that the source of error of each individual method is not only systematical Several new methods exploit this fact and train independently several neural networks and predict based on average prediction of all the networks. Another method (Jpred) gets as input results of several existing methods and predict based on that.

Many web-server available. http://www.compbio.dundee.ac.uk/www-jpred/ To understand some of the sequence signals that might be used we can consider the basic biochemistry of secondary structures α-helix for example has a periodicity of 3.6 amino acids. Helices on the protein surface are expected to posses some signal in this periodicity for positions occupied by hydrophilic and hydrophobic side chains. Finding hydrophobic amino acids in positions i,i+3,i+7, i+10 for example is a strong indication for a helix http://bmerc-www.bu.edu/psa/

α-helix in Myoglobin Similarly, in surface B-strands, there is preferences for Zigzag pattern. For example, hydrophilic side chain at positions i, i+2, i+4... and hydrophobic side chains at positions i+1, i+3, i+5 β-strand of CD8 Related topics Average prediction accuracies from (based on the 480 protein set) for 2-state Solvent Accessibility Prediction secondary structures of membrane proteins Prediction of solvent accessibility Rel. Acc. (%) PSIBLAST (%) HMMER2 (%) Combined [change] (%) 25% 75.0 74.2 76.2 5% 79.0 78.8 79.8 0% 86.6 86.3 86.5