The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains



Similar documents
DBDB : a Disulfide Bridge DataBase for the predictive analysis of cysteine residues involved in disulfide bridges

Data Mining Analysis of HIV-1 Protease Crystal Structures

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Amino Acids. Amino acids are the building blocks of proteins. All AA s have the same basic structure: Side Chain. Alpha Carbon. Carboxyl. Group.

The peptide bond Peptides and proteins are linear polymers of amino acids. The amino acids are

Assessing Checking the the reliability of protein-ligand structures

This class deals with the fundamental structural features of proteins, which one can understand from the structure of amino acids, and how they are

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK

Lecture 19: Proteins, Primary Struture

Structure of proteins

Refinement of a pdb-structure and Convert

Hydrogen Bonds The electrostatic nature of hydrogen bonds

(c) How would your answers to problem (a) change if the molecular weight of the protein was 100,000 Dalton?

Structure Tools and Visualization

Peptide bonds: resonance structure. Properties of proteins: Peptide bonds and side chains. Dihedral angles. Peptide bond. Protein physics, Lecture 5

Structural Bioinformatics Main resource on the web: PDB History Kind of data How to search

PROTEINS THE PEPTIDE BOND. The peptide bond, shown above enclosed in the blue curves, generates the basic structural unit for proteins.

Paper: 6 Chemistry University I Chemistry: Models Page: 2 of Which of the following weak acids would make the best buffer at ph = 5.0?

Tree-Based Consistency Approach for Cloud Databases

Computational Systems Biology. Lecture 2: Enzymes

the recursion-tree method

Scoring Functions and Docking. Keith Davies Treweren Consultants Ltd 26 October 2005

Discrete representations of the protein C. chain Xavier F de la Cruz 1, Michael W Mahoney 2 and Byungkook Lee

Role of Hydrogen Bonding on Protein Secondary Structure Introduction

Combinatorial Biochemistry and Phage Display

The peptide bond is rigid and planar

Part A: Amino Acids and Peptides (Is the peptide IAG the same as the peptide GAI?)

Peptide Bond Amino acids are linked together by peptide bonds to form polypepetide chain.

The FoldX web server: an online force field

Disulfide Bonds at the Hair Salon

Sidechain Torsional Potentials and Motion of Amino Acids in Proteins:

SAnDReS Tutorial 01 Prof. Dr. Walter F. de Azevedo Jr.

Molecular Docking. - Computational prediction of the structure of receptor-ligand complexes. Receptor: Protein Ligand: Protein or Small Molecule

IV. -Amino Acids: carboxyl and amino groups bonded to -Carbon. V. Polypeptides and Proteins

Carbohydrates, proteins and lipids

Structure Check. Authors: Eduard Schreiner Leonardo G. Trabuco. February 7, 2012

Structure Determination

Nafith Abu Tarboush DDS, MSc, PhD

A software tool for the prediction of Xaa-Pro peptide bond conformations in proteins based on 13 C chemical shift statistics

INTRODUCTION TO PROTEIN STRUCTURE

Helices From Readily in Biological Structures

computer programs mmlib Python toolkit for manipulating annotated structural models of biological macromolecules

Contributing Efforts of Various String Matching Methodologies in Real World Applications

1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal , AP, INDIA

Built from 20 kinds of amino acids

18.2 Protein Structure and Function: An Overview

Guide for Bioinformatics Project Module 3

Proteins the primary biological macromolecules of living organisms

Identification of Domains and Domain Interface Residues in Multidomain Proteins From Graph Spectral Method

Consensus alignment server for reliable comparative modeling with distant templates

Molecular Visualization. Introduction

Character Image Patterns as Big Data

Section I Using Jmol as a Computer Visualization Tool


FTIR Analysis of Protein Structure

NO CALCULATORS OR CELL PHONES ALLOWED

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Tutorial for proteome data analysis using the Perseus software platform

Pipe Cleaner Proteins. Essential question: How does the structure of proteins relate to their function in the cell?

Systematic assessment of cancer missense mutation clustering in protein structures

1. A covalent bond between two atoms represents what kind of energy? a. Kinetic energy b. Potential energy c. Mechanical energy d.

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996

Recap. Lecture 2. Protein conformation. Proteins. 8 types of protein function 10/21/10. Proteins.. > 50% dry weight of a cell

Lecture Overview. Hydrogen Bonds. Special Properties of Water Molecules. Universal Solvent. ph Scale Illustrated. special properties of water

Peptide Bonds: Structure

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Shu-Ping Lin, Ph.D.

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Local Search The perfect guide

Amino Acids, Proteins, and Enzymes. Primary and Secondary Structure Tertiary and Quaternary Structure Protein Hydrolysis and Denaturation

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Transcription and Translation of DNA

A. A peptide with 12 amino acids has the following amino acid composition: 2 Met, 1 Tyr, 1 Trp, 2 Glu, 1 Lys, 1 Arg, 1 Thr, 1 Asn, 1 Ile, 1 Cys

Fast Sequential Summation Algorithms Using Augmented Data Structures

Introduction to Protein Folding

Proteins. Proteins. Amino Acids. Most diverse and most important molecule in. Functions: Functions (cont d)

An Overview of Knowledge Discovery Database and Data mining Techniques

Phase determination methods in macromolecular X- ray Crystallography

Recognizing Organic Molecules: Carbohydrates, Lipids and Proteins

Master of Science in Computer Science College of Computer and Information Science. Course Syllabus

STRUCTURAL QUALITY ASSURANCE

EMBL-EBI. 3D databases and data warehouse technology

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA

A Review of Data Mining Techniques

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 6 Sep 1997

Molecular Genetics. RNA, Transcription, & Protein Synthesis

THERMODYNAMIC PROPERTIES OF POLYPEPTIDE CHAINS. PARALLEL TEMPERING MONTE CARLO SIMULATIONS

Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets-

Conditional Random Fields: An Introduction

Prediction Center s Data Guide

How To Use Neural Networks In Data Mining

file:///c /Documents%20and%20Settings/terry/Desktop/DOCK%20website/terry/Old%20Versions/dock4.0_faq.txt

Visual Structure Analysis of Flow Charts in Patent Images

13.2 Ribosomes & Protein Synthesis

Lecture 4: Peptides and Protein Primary Structure [PDF] Key Concepts. Objectives See also posted Peptide/pH/Ionization practice problems.

The Steps. 1. Transcription. 2. Transferal. 3. Translation

On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data:

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Transcription:

The Ramachandran Map of More Than 1 6,500 Perfect Polypeptide Chains Zoltán Szabadka, Rafael Ördög, Vince Grolmusz manuscript received March 19, 2007 Z. Szabadka, R. Ördög and V. Grolmusz are with Eötvös University, H-1117 Budapest, Hungary

2 Abstract The Protein Data Bank (PDB) [3] is the most important depository of protein structural information, containing more than 42,000 deposited entries today. Because of its inhomogeneous structure, its fully automated processing is almost impossible. In a previous work, we cleaned and re-structured the entries in the Protein Data Bank, and from the result we have built the RS-PDB database [11]. Using the RS-PDB database, here we draw a Ramachandran-plot from 6,593 perfect polypeptide chains found in the PDB, containing 1,192,689 residues, this is more than tenfold increase in the size of data analyzed before this work. The density of the data points makes possible to draw a logarithmic heat map enhanced Ramachandran map, showing the fine inner structure of the right-handed α-helix region. Keywords: Protein Data Bank, biochemical databases, database cleaning, Ramachandran map, torsion angles, multi-protein Ramachandran plot

3 I. INTRODUCTION The Protein Data Bank [3] developed as the publicly available depository of the three-dimensional structural data of proteins and nucleic acids. As long as most researchers were interested in one protein, it was perfectly adequate to use human assisted pre-processing of that pdb-file for review or structural studies; frequently even the accompanying journal publication had to be reviewed for proper interpretation of the data deposited. Today, however, the PDB contains more than 42,000 entries, and lots of studies would use this treasury, involving even hundreds or thousands of proteins. Unfortunately, the inhomogeneous structure of the PDB makes the completely automated processing difficult. The PDBsum pictorial database [7], [1] contains lots of improvements over the original PDB, but it is mainly a graphical interface to the original PDB. In [12] we presented an algorithm for reliably finding protein-ligand complexes in the Protein Data Bank, and also for repairing certain inconsistencies in the database. Seeking methods for arranging protein structural data by strict mathematical rules, we further developed the ideas from [12]: we have built the Rich-Structure PDB (abbreviated RS-PDB) database from the PDB data [11]. That database contains the result of a mathematical procedure in which every peptide- or nucleotide-chain and ligand molecule were re-built, verified and classified by completely automatic processing using basically graph-theoretic algorithms [11]. In [6], using the RS-PDB database, we identified protein-ligand binding sites and analyzed the residue-composition on the binding sites from the whole Protein Data Bank. In the present work we intend to analyze the torsion angles of the peptide backbone

4 in those peptide-chains in the PDB which satisfy some strict quality requirements; however, we do not restrict our study to some small subset of data: from the strict structure of the RS-PDB database one is able to choose and process thousands of peptide-chains for such a study. Our goal was to draw a Ramachandran map [9] of as large high-quality data set as possible from the PDB. The Ramachandran map is an unparalleled tool in protein structure validation and prediction, and the better understanding of its properties and the shapes of its regions is an active research field today (e.g., [10], [5], [8]). As reported in the literature, polypeptide chains are usually chosen by the resolution of the underlying PDB entry as the quality criterion; the numerous redundancies in the PDB [12] are usually addressed by some constraints on the residue sequence. For the drawing and the analysis of higher-order Ramachandran plots in [10], PDB entries with better than 1 Å resolution and less than 25% sequence similarity were chosen: the data set contained 51 structures and 10,976 φ ψ pairs. In the deep analysis of [5], a dataset of [8] was used, consisting of 500 nonhomogeneous structures of resolution better than 1.8 Å; the set contained 97,368 residues. In our present work, we use more intricate quality criteria and found 6,593 pairwise different high quality polypeptide chains, containing 1,192,689 residues. The Ramachandran map of our data-set is given on Figure 1.

5 II. METHODOLOGY We used the June 1, 2006 version of the PDB for our work. We intended to process as many polypeptide chains as it was possible without sacrificing data validity. Several authors filtered the PDB data by First, we have chosen those polypeptide chains, where all the atomic coordinates of the heavy (i.e., non-hydrogen) atoms were given in the PDB mmcif file. We verified this by comparing the entries of the residue-sequence information with the atomic coordinate-data. This step was done since missing atoms either in the polypeptide chains or in the side-chains would imply one or more incorrect φ ψ angle readings. Next, the residue-composition of the polypeptide chains were reviewed. We have chosen only those chains where no modified residues were present: modified residues, due to their different geometry, would imply different φ ψ angle distributions. For the list of modified residues found in the PDB, see Table 2 in [12]. Next, we verified the correctness of the lengths of the covalent bonds in the polypeptide chains. In the process of building the RS-PDB database [11], [6], we generated a table, containing bond errors, as follows: By building a kd-tree [2], all the atom-pairs of distance less than 6 Åwere identified. If two atoms from different residues or from different polypeptide-chains were placed within covalent distance, we enter this fact to the bond-error table. If two atoms from the same residue were placed within covalent distance, but between them no covalent bond may exist, then we also enter this fact to the bond-error table. We use only those polypeptide chains for constructing the Ramachandran-plot, what do not contain bond errors, listed in the database table.

6 As the last step, we filtered out the homogeneous sequences by using an easily applicable hash [4] function (an MD5 hashing), and the remaining set consisted of pairwise non-homogeneous polypeptide chains. After this filtering procedure, we were left with 6,593 polypeptide chains containing 1,192,689 residues.

7 Fig. 1. The logarithmic heat-map enhanced Ramachandran plot of 1,192,689 residues. The colors correspond to the natural logarithm of the density of the data-points in the map. Note the density distribution on the right-handed α-helix region. III. RESULTS AND DISCUSSION Because of the large size of our data-set, we were able to draw a logarithmic heat-map enhanced Ramachandran plot [9] from 6,593 polypeptide chains containing 1,192,689 residues on Figure 1. The colors correspond to the natural logarithm of the density of the data-points in the map. Note the density distribution in the righthanded α-helix region, close to the point (-60,-50): using the color-coding, one get clearly quantitative distribution of the densities. By the best of our knowledge, no such heat-map enhanced Ramchandran-plot appeared before in the literature, since the data sets were clearly too small.

8 REFERENCES [1] Laskowski R. A., Chistyakov V. V., and Thornton J. M. Pdbsum more: new summaries and analyses of the known 3d structures of proteins and nucleic acids. Nucleic Acids Res., 2005. [2] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509 517, 1975. [3] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The Protein Data Bank. Nucleic Acids Research, 28:235 242, 2000. [4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, Cambridge, MA, second edition, 2001. [5] Bosco K. Ho, Annick Thomas, and Robert Brasseur. Revisiting the Ramachandran plot: Hard-sphere repulsion, electrostatics, and H-bonding in the α-helix. Protein Science, 12:2508 2522, 2003. [6] Gabor Ivan, Zoltan Szabadka, and Vince Grolmusz. Cysteine and tryptophane anomalies found when scanning all the binding sites in the Protein Data Bank. In submitted, 2007. [7] R. A. Laskowski. PDBsum: summaries and analyses of PDB structures. Nucleic Acids Research, 29:221 222, 2001. [8] S.C. Lovell, I.W. Davis, W.B. Arendall III, P.I.W. de Bakker, J.M. Word, M.G. Prisant, J.S. Richardson, and D.C. Richardson. Structure validation by C α geometry: ϕ, ψ and C β deviation. Proteins, 50:437 450, 2003. [9] G. N. Ramachandran and V. Sasisekharan. J. Mol. Biol., 7:95 99, 1963. [10] Gregory E. Sims, In-Geol Choi, and Sung-Hou Kim. Protein conformational space in higher order ϕ-ψ maps. Proc. Nat. Acad. Sci., 102:618 621, 2005. doi:10.1073/pnas.0408746102. [11] Zoltan Szabadka and Vince Grolmusz. Building a structured PDB: The RS-PDB database. Proceedings of the 28th IEEE EMBS Annual International Conference, New York, NY, Aug. 30-Sept 3, 2006, pages 5755 5758, 2006. [12] Zoltan Szabadka and Vince Grolmusz. High throughput processing of the structural information in the Protein Data Bank. J. Mol. Graph. Model., 25(6):831 836, 2007.