ANALYSIS OF PROTEOME COMPEXITY BASED ON COUNTING DOMAIN-TO-PROTEIN LINKS

Similar documents
Can Auto Liability Insurance Purchases Signal Risk Attitude?

What is Candidate Sampling

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

Forecasting the Direction and Strength of Stock Market Movement

CHAPTER 14 MORE ABOUT REGRESSION

Ring structure of splines on triangulations

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Portfolio Loss Distribution

STATISTICAL DATA ANALYSIS IN EXCEL

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

An Alternative Way to Measure Private Equity Performance

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

How To Calculate The Accountng Perod Of Nequalty

DEFINING %COMPLETE IN MICROSOFT PROJECT

An Interest-Oriented Network Evolution Mechanism for Online Communities

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

1. Measuring association using correlation and regression

Brigid Mullany, Ph.D University of North Carolina, Charlotte

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Two Faces of Intra-Industry Information Transfers: Evidence from Management Earnings and Revenue Forecasts

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Statistical Methods to Develop Rating Models

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Do Changes in Customer Satisfaction Lead to Changes in Sales Performance in Food Retailing?

Damage detection in composite laminates using coin-tap method

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Data Visualization by Pairwise Distortion Minimization

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Extending Probabilistic Dynamic Epistemic Logic

A DATA MINING APPLICATION IN A STUDENT DATABASE

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Macro Factors and Volatility of Treasury Bond Returns

Factors Affecting Outsourcing for Information Technology Services in Rural Hospitals: Theory and Evidence

Traffic State Estimation in the Traffic Management Center of Berlin

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Properties of real networks: degree distribution

Sketching Sampled Data Streams

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Returns to Experience in Mozambique: A Nonparametric Regression Approach

Analysis of Premium Liabilities for Australian Lines of Business

Imperial College London

Automated information technology for ionosphere monitoring of low-orbit navigation satellite signals

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

The Choice of Direct Dealing or Electronic Brokerage in Foreign Exchange Trading

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

A statistical approach to determine Microbiologically Influenced Corrosion (MIC) Rates of underground gas pipelines.

A Fast Incremental Spectral Clustering for Large Data Sets

where the coordinates are related to those in the old frame as follows.

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Self-Consistent Proteomic Field Theory of Stochastic Gene Switches

Implementation of Deutsch's Algorithm Using Mathcad

HÜCKEL MOLECULAR ORBITAL THEORY

Performance Analysis and Coding Strategy of ECOC SVMs

1 Example 1: Axis-aligned rectangles

Copulas. Modeling dependencies in Financial Risk Management. BMI Master Thesis

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Abstract. 260 Business Intelligence Journal July IDENTIFICATION OF DEMAND THROUGH STATISTICAL DISTRIBUTION MODELING FOR IMPROVED DEMAND FORECASTING

SIMPLE LINEAR CORRELATION

Chapter 8 Group-based Lending and Adverse Selection: A Study on Risk Behavior and Group Formation 1

Is There A Tradeoff between Employer-Provided Health Insurance and Wages?

Efficient Project Portfolio as a tool for Enterprise Risk Management

Covariate-based pricing of automobile insurance

BERNSTEIN POLYNOMIALS

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Management Quality and Equity Issue Characteristics: A Comparison of SEOs and IPOs

A Probabilistic Theory of Coherence

L10: Linear discriminants analysis

Microarray data normalization and transformation

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

A Secure Password-Authenticated Key Agreement Using Smart Cards

Multiple-Period Attribution: Residuals and Compounding

Product-Form Stationary Distributions for Deficiency Zero Chemical Reaction Networks

Management Quality, Financial and Investment Policies, and. Asymmetric Information

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

IT09 - Identity Management Policy

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Enterprise Master Patient Index

How To Analyze The Flow Patterns Of A Fracture Network

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds

Description of the Force Method Procedure. Indeterminate Analysis Force Method 1. Force Method con t. Force Method con t

When do data mining results violate privacy? Individual Privacy: Protect the record

The OC Curve of Attribute Acceptance Plans

Evaluating the generalizability of an RCT using electronic health records data

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Time Domain simulation of PD Propagation in XLPE Cables Considering Frequency Dependent Parameters

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Transcription:

COMUTATIONAL STRUCTURAL AND FUNCTIONAL ROTEOMICS ANALYSIS OF ROTEOME COMEXITY BASED ON COUNTING DOMAIN-TO-ROTEIN LINKS Kuznetsov V.A.*, ckalov V.V. 2, Knott G.D. 3, Kanapn A.A. 4 CIT/NIH & SRA Internatonal, Inc. Bethesda, MD, 20892, USA, e-mal: vk28u@nh.gov; 2 Insttute of Theoretcal and Appled Mechancs SB RAS, Novosbrsk, Russa, e-mal: pckalov@tam.nsc.ru; 3 Cvlzed Software, Inc., Slver Sprng, MD, 20906, USA; 4 EMBL-EBI Wellcome Trust Genome Campus, Hnxton, CB0 SD, UK, * Correspondng author: e-mal: alex@eb.ac.uk Keywords: proteome complexty, evoluton, doman-to-proten lnks, skew dstrbutons Summary We defne the lst of domans to protens, together wth the numbers of ther occurrences (lnks to protens) found n the proteome of an organsm to be the doman-to-proten lnkage profle (DL) of the proteome. We estmated the DL for the 56 fully-sequenced genome organsms represented n the Interro database. Ths work presents several quanttatve measures of the complexty of a proteome based on the DL. For each of the 56 studed organsms, we found two large subsets of domans: the domans whch occur two or more tmes n at least one proten, and the domans whch are not duplcated wthn any proten of the proteome. The latter set of domans well reflects the ncreasng trend of bologcal complexty due to evoluton. The statstcal dstrbutons of the number of doman-to-proten lnks n the proteome and the estmates of the dfferences between the DLs for pars of organsms are used as measures of relatve bologcal complexty of the organsms. In partcular, we show quanttatvely the greater complexty of the human proteome, relatve to that of a mouse or a rat. These dfferences are only partally reflected n the number of proten-codng genes estmated for these speces by the sequencng genome projects. Introducton There are no consensus quanttatve measures of the relatve or absolute complexty of an organsm. The bologcal complexty of an organsm should be characterzed by the lst of all protens n a gven proteome and ther dynamc nteractons among themselves (proten network), the other molecules (proten-dna network, etc.). However, the total number of putatve proten sequences based on complete genome sequence data of a gven organsm can be predcted only approxmately, and our knowledge on dynamc nteractons of protens n an organsm s also essentally ncomplete. Thus, we need a more tractable and practcal approaches to descrbng the bologcal complexty. rotens contan short structural and/or functonal 'blocks' (sequences of amno acds) called structural motfs (of length 5-35 nt) and structural domans (of length 30-250 nt) that are seen repeatedly n many protens of all speces whose genome have been examned. The entre number of domans n nature s probably a very lmted number. The estmated number of such classes of homologous sequences of motfs and domans n nature ranges from 6,000 to 0,000 or so [-3]. In ths work, all such proten blocks wll be called, for brevty, domans. The domans are essental to the bologcal functon (s) of the proten n whch they occur and serve as evolutonarly conserved buldng blocks for the formng of protens. The domans are correspondng to specfc sequences of DNA wthn genes whch have been evolutonarly conserved. DNA correspondng to a doman may occur multple tmes n a gven proten-codng gene and/or n many dfferent proten-codng genes wthn a gven genome, and n the genomes of many speces. We attempt to determne the measures of the bologcal complexty based on a lmted set of domans. 293

If a doman D occurs n the proten, we say ths consttutes a doman-to-proten lnk. The lst j of domans together wth the numbers of occurrences of each doman (lnks) found n the proteome of an organsm s called the doman-to-proten lnkage profle (DL) of the proteome. The DL should allow us to characterze the doman-to-proten lnk network for each organsm. Currently, we do not know all domans and all protens n a gven organsm. We can study the DL of organsm by observng sample DLs n representatve proteomes (.e. proten sets whch were specfed for organsms). However, several proten doman/motf databases provde large enough samples of DLs for a varety of organsms. We can analyze the samples of doman-to-proten lnkage nformaton n fully-sequenced genome organsms to categorze ther proteome complextes. Ths work presents several quanttatve measures of the complexty of a proteome based on observed DLs that appear to be approprate and consstent. roten Doman Database Analyzer Our nformaton about DLs of the 56 fully-sequenced genome archaeal, bacteral and eukaryotc organsms was obtaned from the Interro database (www.eb.ac.uk/nterpro, Aprl, 2004) [5], the proten sequences for the organsms were obtaned from the roteome Analyss Database (www.eb.ac.uk/proteome, Aprl 2004). We developed a roten Doman Database Analyzer (DDA) program whch we used to access the Interro database and download data nto a local MySQL relatonal database []. The MySQL database conssts of a N tot L table, where N tot = 9609 domans; and L = 56 organsms. Each row corresponds to an Interro doman and each column corresponds to an organsm. The (, S) -th entry of the table s the number of occurrences of the doman n the proteome of organsm S. Informaton of ths table was analyzed by DDA data mnng tools, whch nclude the statstcal and graphcal functons, logcal functons, descrptve statstcs, and correlaton analyss. Defntons Let us focus on a specfc organsm S. Let A = { a, a2,... an } be the set of observed domans contaned n the protens of the organsm S. Let B = { b, b2,..., b} be the set of protens n the representatve proteome (sample of observed protens) for the organsm S. We defne the N adjacency matrx C ': = [ c' j ] ( =,2,..., N ; j =,2,..., ) for the organsm S as follows: the value c' j = k, f proten b j contans doman a exactly k tmes. Note that k {0,,2,...}. The value c' s the number of doman-to-proten lnks for the (doman, proten) par a, b }. j { j Let : = C' * = c' j j= denote the number of occurrences of the doman a n all protens of the representatve proteome of S; =,2,..., N. We call the value m the number of redundant doman-to-proten lnks of the doman a of S. Let M denote the total number of doman-toproten lnks n the representatve proteome of the organsm S. M ' = 294 N = j= c' j. The value M s

called the redundant connectvty number of the (doman, proten) network. Note that all occurrences of a doman n the protens of S are counted. We also defne the adjacency matrx C, where c j =, f proten b j contans doman a at least once, and we defne c j = 0 otherwse. Ths matrx reflects the non-redundant structure of the doman-toproten lnks n the (doman, proten) network. Ths matrx counts each (doman, proten) lnk only once. Let m = c j j= lnks of the doman a of S. Let M =. We call the value m the number of non-redundant doman-to-proten N c j = j=. Note M M '. We call the matrx C the nonredundant adjacency matrx of S, and we call M the non-redundant connectvty number of the (doman, proten) network of S. The values M and M have been used as the non redundant and redundant measures of the proteome complexty of an organsm [, 4]. To quantfy the complexty assocated wth multple occurrences of the doman a n an organsm, we defne the redundancy value δ m = m, ( ) () To quantfy the relatve complexty of the doman a whch redundantly occurs m ' s tmes n the organsm S and whch redundantly occurs k m ' ( ) tmes n the organsm K, we can defne the dfference of redundant complexty value for the doman a ( s) ( δ =. By omttng the prm symbol n above notatons, we can also ntroduce the dfference of non redundant complexty value of the doman ( s) ( = m m δ m. a To characterze the relatve redundant complexty of the organsm S versus the organsm K, we use the emprcal bvarate dstrbutons of the random values ( µ ', δ ), where =,2,..., N ; µ ' = ( + )/ 2 s a mean value of the numbers of the lnks and N s the total number of domans occurred n doman-to-proten profles of the organsms S and K. By omttng the prm symbol n above notatons, we obtan the emprcal bvarate dstrbuton of the random values ( µ, δ m ), whch we can use as the relatve non redundant complexty measure of the organsms S and K. Results and Dscusson There are at least two dstngush processes of doman s spread n nature: ntegraton of two or more dfferent domans n a new proten (formng non-redundant doman-to-proten lnks) and multplcaton of a doman wthn the same proten (formng redundant doman-to-proten lnks). anels A-C n Fgures show the relatonshps between the numbers of non-redundant and the numbers of redundant doman-to-proten lnks counted for the T. volcanum, E. col and human 295

representatve proteomes. These panels demonstrate typcal patterns of the bvarate frequency dstrbutons of the random values ( m, ) (={ ( m, ),...,( m N ( s ), N ( s ) )}) correspondng to the numbers of redundant and non-redundant occurrences of domans n the organsm S. All panels n Fgure show a preferental dstrbuton of ponts along the dagonal; the spread of ponts n the orthogonal drecton to the dagonal s smaller. The extreme cases (T. volcanum and human proteomes) clearly dsplay the ncreased complexty of the human proteome due to the ncreased number of protens whch preferentally combnng dfferent domans together. anel D n Fg. shows the observed values ( m, ) for the 56 organsms s a factor of 0 larger then any specfc organsm. We found that these dstrbuton patterns are typcal for the studed archael, bacteral and eukaryotc organsms. Fgures A-D mply that ncreasng bologcal complexty n nature s mostly assocated wth formaton of new mult-doman protens combnng dfferent domans rater that wth the ncrease of doman-to-proten lnks occurrng due to ncreasng of multplcaton of domans wthn protens n whch domans have been already occurred. Fg.. The emprcal dstrbutons of the numbers of redundant lnks versus the numbers of non redundant lnks counted for the (A) T. volcanum, (B) E. col and (C) human representatve proteomes, and for (D) pooled data of 56 organsms. Dscontnue lne: lnear regresson; sold lne: dagonal. Fg. 2. The relatve complexty of proteome of a human versus a mouse, a rat (C) and A. thalna, respectvely. A: the dstrbuton of ( µ ', δ ); symbol s ndcates the human organsm, k = ndcates the mouse organsm. B, C, D: the dstrbutons of ( µ, δ m ); k=2, 3, 4 ndcate the mouse, the rat and A. thalna, (, respectvely. =,2,..., N s. 296

For all of the 56 studed organsms, we also found two dsjonted subsets of domans: the domans whch occurred two or more tmes n at least one proten of a proteome, and the domans whch was not multpled wthn any proten of the proteome (see examples on Fg. ). 454 (47 %) of the 9609 domans occurred two or more tmes n a same proten of the 56 organsms. We also found that f a doman s more common n the entre proteome world, then that doman has a bgger chance to appear multple tmes n a proten of any organsm (see Fg. ). Fgure 2 demonstrates how the proteome complexty of pars of organsms can be compared. Ths fgure shows the relatve proteome complexty measure for a human versus a mouse, a rat and A. thalana, respectvely. For example, the panel A shows the bvarate dstrbutons of dfferences of the number of redundant lnks for the human and mouse organsms, multpled by factor 0.5 wth respect to average number of the redundant lnks of the human and mouse organsms. Ths dstrbuton s hghly asymmetrc about axes x: the number of postve dfferences (abundant for a human) s bgger than the number of negatve dfferences (abundant for a mouse), and the postve hghlyabundant values occur more often n the human sample than n the mouse sample. Smlar asymmetrc trend we observed for our human-mouse comparson n the case of redundant lnks (Fg. 2B). Ths ndcates that the human organsm reuses domans more frequently n same proten and n dfferent protens, and nvents more dverse mult-doman protens than the mouse organsm. Ths analyss suggests that even though that the numbers of non redundant proten-codng genes n a human (~33,600 genes [, 4]) and n a mouse (~32,000 genes, by our current estmate based on the method n [, 4]) are approxmately smlar, and that even the total number of domans are also smlar numbers n these organsms (~5,600 for a human and ~5,00 for a mouse [4]), the proteome complexty and dversty of a sgnfcant fracton of mult-doman protens n a human s hgher than n a mouse. Large asymmetry n the bvarate dstrbuton of the values ( µ, δ ) around axes x we observed for a human versus a rat (Fg. 2C). However, the human and A. thalana proteomes show smlar proteome complexty by our crtera (Fg. 2D). Ths s not a surprse, because even though the number of proten-codng genes n human s larger than the number of proten-codng genes n A. thalana (~26,000-27,000 [, 4]), relatvely recent massve (and mperfect) genome duplcaton events n A. thalana mght dramatcally ncreased of proten repertore due to recombnaton events, whch perhaps ncreased of the doman shufflng n protens, but dd not ncrease of the number of domans (~5,300 domans [4]) presented n the A. thalana proteome. References. Kuznetsov V.A. Statstcs of the numbers of transcrpts and proten sequences encoded n the genome // Computatonal and Statstcal Methods to Genomcs / Eds. Zhang W., Shmulevch I. Kluwer, Boston, 2002.. 25 7. 2. Kuznetsov V.A., ckalov V.V., Senko O.V., Knott G.D. Analyss of evolvng proteomes: redctons of the number of proten domans n nature and the number of genes n eukaryotc organsms // J. Bol. Systems. 2002. V. 0.. 38 407. 3. Kuznetsov V.A. Famly of skewed dstrbutons assocated wth the gene expresson and proteome evoluton // Sgnal rocessng. 2003a. V. 83.. 889 90. 4. Kuznetsov V.A. A stochastc model of evoluton of conserved proten codng sequence n the archaeal, bacteral and eukaryotc proteomes // Fluctuaton and Nose Letters. 2003b. V. 3. L295 L324. 5. Mulder N.J., Apweler R., Attwood T.K. et al. Interro Consortum. The Interro Database, 2003 brngs ncreased coverage and new features // Nucl. Acds Res. 2003. V. 3.. 35 38. m 297