ANALYSIS OF PROTEOME COMPEXITY BASED ON COUNTING DOMAIN-TO-PROTEIN LINKS

COMUTATIONAL STRUCTURAL AND FUNCTIONAL ROTEOMICS ANALYSIS OF ROTEOME COMEXITY BASED ON COUNTING DOMAIN-TO-ROTEIN LINKS Kuznetsov V.A.*, ckalov V.V. 2, Knott G.D. 3, Kanapn A.A. 4 CIT/NIH & SRA Internatonal, Inc. Bethesda, MD, 20892, USA, e-mal: vk28u@nh.gov; 2 Insttute of Theoretcal and Appled Mechancs SB RAS, Novosbrsk, Russa, e-mal: pckalov@tam.nsc.ru; 3 Cvlzed Software, Inc., Slver Sprng, MD, 20906, USA; 4 EMBL-EBI Wellcome Trust Genome Campus, Hnxton, CB0 SD, UK, * Correspondng author: e-mal: alex@eb.ac.uk Keywords: proteome complexty, evoluton, doman-to-proten lnks, skew dstrbutons Summary We defne the lst of domans to protens, together wth the numbers of ther occurrences (lnks to protens) found n the proteome of an organsm to be the doman-to-proten lnkage profle (DL) of the proteome. We estmated the DL for the 56 fully-sequenced genome organsms represented n the Interro database. Ths work presents several quanttatve measures of the complexty of a proteome based on the DL. For each of the 56 studed organsms, we found two large subsets of domans: the domans whch occur two or more tmes n at least one proten, and the domans whch are not duplcated wthn any proten of the proteome. The latter set of domans well reflects the ncreasng trend of bologcal complexty due to evoluton. The statstcal dstrbutons of the number of doman-to-proten lnks n the proteome and the estmates of the dfferences between the DLs for pars of organsms are used as measures of relatve bologcal complexty of the organsms. In partcular, we show quanttatvely the greater complexty of the human proteome, relatve to that of a mouse or a rat. These dfferences are only partally reflected n the number of proten-codng genes estmated for these speces by the sequencng genome projects. Introducton There are no consensus quanttatve measures of the relatve or absolute complexty of an organsm. The bologcal complexty of an organsm should be characterzed by the lst of all protens n a gven proteome and ther dynamc nteractons among themselves (proten network), the other molecules (proten-dna network, etc.). However, the total number of putatve proten sequences based on complete genome sequence data of a gven organsm can be predcted only approxmately, and our knowledge on dynamc nteractons of protens n an organsm s also essentally ncomplete. Thus, we need a more tractable and practcal approaches to descrbng the bologcal complexty. rotens contan short structural and/or functonal 'blocks' (sequences of amno acds) called structural motfs (of length 5-35 nt) and structural domans (of length 30-250 nt) that are seen repeatedly n many protens of all speces whose genome have been examned. The entre number of domans n nature s probably a very lmted number. The estmated number of such classes of homologous sequences of motfs and domans n nature ranges from 6,000 to 0,000 or so [-3]. In ths work, all such proten blocks wll be called, for brevty, domans. The domans are essental to the bologcal functon (s) of the proten n whch they occur and serve as evolutonarly conserved buldng blocks for the formng of protens. The domans are correspondng to specfc sequences of DNA wthn genes whch have been evolutonarly conserved. DNA correspondng to a doman may occur multple tmes n a gven proten-codng gene and/or n many dfferent proten-codng genes wthn a gven genome, and n the genomes of many speces. We attempt to determne the measures of the bologcal complexty based on a lmted set of domans. 293

If a doman D occurs n the proten, we say ths consttutes a doman-to-proten lnk. The lst j of domans together wth the numbers of occurrences of each doman (lnks) found n the proteome of an organsm s called the doman-to-proten lnkage profle (DL) of the proteome. The DL should allow us to characterze the doman-to-proten lnk network for each organsm. Currently, we do not know all domans and all protens n a gven organsm. We can study the DL of organsm by observng sample DLs n representatve proteomes (.e. proten sets whch were specfed for organsms). However, several proten doman/motf databases provde large enough samples of DLs for a varety of organsms. We can analyze the samples of doman-to-proten lnkage nformaton n fully-sequenced genome organsms to categorze ther proteome complextes. Ths work presents several quanttatve measures of the complexty of a proteome based on observed DLs that appear to be approprate and consstent. roten Doman Database Analyzer Our nformaton about DLs of the 56 fully-sequenced genome archaeal, bacteral and eukaryotc organsms was obtaned from the Interro database (www.eb.ac.uk/nterpro, Aprl, 2004) [5], the proten sequences for the organsms were obtaned from the roteome Analyss Database (www.eb.ac.uk/proteome, Aprl 2004). We developed a roten Doman Database Analyzer (DDA) program whch we used to access the Interro database and download data nto a local MySQL relatonal database []. The MySQL database conssts of a N tot L table, where N tot = 9609 domans; and L = 56 organsms. Each row corresponds to an Interro doman and each column corresponds to an organsm. The (, S) -th entry of the table s the number of occurrences of the doman n the proteome of organsm S. Informaton of ths table was analyzed by DDA data mnng tools, whch nclude the statstcal and graphcal functons, logcal functons, descrptve statstcs, and correlaton analyss. Defntons Let us focus on a specfc organsm S. Let A = { a, a2,... an } be the set of observed domans contaned n the protens of the organsm S. Let B = { b, b2,..., b} be the set of protens n the representatve proteome (sample of observed protens) for the organsm S. We defne the N adjacency matrx C ': = [ c' j ] ( =,2,..., N ; j =,2,..., ) for the organsm S as follows: the value c' j = k, f proten b j contans doman a exactly k tmes. Note that k {0,,2,...}. The value c' s the number of doman-to-proten lnks for the (doman, proten) par a, b }. j { j Let : = C' * = c' j j= denote the number of occurrences of the doman a n all protens of the representatve proteome of S; =,2,..., N. We call the value m the number of redundant doman-to-proten lnks of the doman a of S. Let M denote the total number of doman-toproten lnks n the representatve proteome of the organsm S. M ' = 294 N = j= c' j. The value M s

called the redundant connectvty number of the (doman, proten) network. Note that all occurrences of a doman n the protens of S are counted. We also defne the adjacency matrx C, where c j =, f proten b j contans doman a at least once, and we defne c j = 0 otherwse. Ths matrx reflects the non-redundant structure of the doman-toproten lnks n the (doman, proten) network. Ths matrx counts each (doman, proten) lnk only once. Let m = c j j= lnks of the doman a of S. Let M =. We call the value m the number of non-redundant doman-to-proten N c j = j=. Note M M '. We call the matrx C the nonredundant adjacency matrx of S, and we call M the non-redundant connectvty number of the (doman, proten) network of S. The values M and M have been used as the non redundant and redundant measures of the proteome complexty of an organsm [, 4]. To quantfy the complexty assocated wth multple occurrences of the doman a n an organsm, we defne the redundancy value δ m = m, ( ) () To quantfy the relatve complexty of the doman a whch redundantly occurs m ' s tmes n the organsm S and whch redundantly occurs k m ' ( ) tmes n the organsm K, we can defne the dfference of redundant complexty value for the doman a ( s) ( δ =. By omttng the prm symbol n above notatons, we can also ntroduce the dfference of non redundant complexty value of the doman ( s) ( = m m δ m. a To characterze the relatve redundant complexty of the organsm S versus the organsm K, we use the emprcal bvarate dstrbutons of the random values ( µ ', δ ), where =,2,..., N ; µ ' = ( + )/ 2 s a mean value of the numbers of the lnks and N s the total number of domans occurred n doman-to-proten profles of the organsms S and K. By omttng the prm symbol n above notatons, we obtan the emprcal bvarate dstrbuton of the random values ( µ, δ m ), whch we can use as the relatve non redundant complexty measure of the organsms S and K. Results and Dscusson There are at least two dstngush processes of doman s spread n nature: ntegraton of two or more dfferent domans n a new proten (formng non-redundant doman-to-proten lnks) and multplcaton of a doman wthn the same proten (formng redundant doman-to-proten lnks). anels A-C n Fgures show the relatonshps between the numbers of non-redundant and the numbers of redundant doman-to-proten lnks counted for the T. volcanum, E. col and human 295

representatve proteomes. These panels demonstrate typcal patterns of the bvarate frequency dstrbutons of the random values ( m, ) (={ ( m, ),...,( m N ( s ), N ( s ) )}) correspondng to the numbers of redundant and non-redundant occurrences of domans n the organsm S. All panels n Fgure show a preferental dstrbuton of ponts along the dagonal; the spread of ponts n the orthogonal drecton to the dagonal s smaller. The extreme cases (T. volcanum and human proteomes) clearly dsplay the ncreased complexty of the human proteome due to the ncreased number of protens whch preferentally combnng dfferent domans together. anel D n Fg. shows the observed values ( m, ) for the 56 organsms s a factor of 0 larger then any specfc organsm. We found that these dstrbuton patterns are typcal for the studed archael, bacteral and eukaryotc organsms. Fgures A-D mply that ncreasng bologcal complexty n nature s mostly assocated wth formaton of new mult-doman protens combnng dfferent domans rater that wth the ncrease of doman-to-proten lnks occurrng due to ncreasng of multplcaton of domans wthn protens n whch domans have been already occurred. Fg.. The emprcal dstrbutons of the numbers of redundant lnks versus the numbers of non redundant lnks counted for the (A) T. volcanum, (B) E. col and (C) human representatve proteomes, and for (D) pooled data of 56 organsms. Dscontnue lne: lnear regresson; sold lne: dagonal. Fg. 2. The relatve complexty of proteome of a human versus a mouse, a rat (C) and A. thalna, respectvely. A: the dstrbuton of ( µ ', δ ); symbol s ndcates the human organsm, k = ndcates the mouse organsm. B, C, D: the dstrbutons of ( µ, δ m ); k=2, 3, 4 ndcate the mouse, the rat and A. thalna, (, respectvely. =,2,..., N s. 296

For all of the 56 studed organsms, we also found two dsjonted subsets of domans: the domans whch occurred two or more tmes n at least one proten of a proteome, and the domans whch was not multpled wthn any proten of the proteome (see examples on Fg. ). 454 (47 %) of the 9609 domans occurred two or more tmes n a same proten of the 56 organsms. We also found that f a doman s more common n the entre proteome world, then that doman has a bgger chance to appear multple tmes n a proten of any organsm (see Fg. ). Fgure 2 demonstrates how the proteome complexty of pars of organsms can be compared. Ths fgure shows the relatve proteome complexty measure for a human versus a mouse, a rat and A. thalana, respectvely. For example, the panel A shows the bvarate dstrbutons of dfferences of the number of redundant lnks for the human and mouse organsms, multpled by factor 0.5 wth respect to average number of the redundant lnks of the human and mouse organsms. Ths dstrbuton s hghly asymmetrc about axes x: the number of postve dfferences (abundant for a human) s bgger than the number of negatve dfferences (abundant for a mouse), and the postve hghlyabundant values occur more often n the human sample than n the mouse sample. Smlar asymmetrc trend we observed for our human-mouse comparson n the case of redundant lnks (Fg. 2B). Ths ndcates that the human organsm reuses domans more frequently n same proten and n dfferent protens, and nvents more dverse mult-doman protens than the mouse organsm. Ths analyss suggests that even though that the numbers of non redundant proten-codng genes n a human (~33,600 genes [, 4]) and n a mouse (~32,000 genes, by our current estmate based on the method n [, 4]) are approxmately smlar, and that even the total number of domans are also smlar numbers n these organsms (~5,600 for a human and ~5,00 for a mouse [4]), the proteome complexty and dversty of a sgnfcant fracton of mult-doman protens n a human s hgher than n a mouse. Large asymmetry n the bvarate dstrbuton of the values ( µ, δ ) around axes x we observed for a human versus a rat (Fg. 2C). However, the human and A. thalana proteomes show smlar proteome complexty by our crtera (Fg. 2D). Ths s not a surprse, because even though the number of proten-codng genes n human s larger than the number of proten-codng genes n A. thalana (~26,000-27,000 [, 4]), relatvely recent massve (and mperfect) genome duplcaton events n A. thalana mght dramatcally ncreased of proten repertore due to recombnaton events, whch perhaps ncreased of the doman shufflng n protens, but dd not ncrease of the number of domans (~5,300 domans [4]) presented n the A. thalana proteome. References. Kuznetsov V.A. Statstcs of the numbers of transcrpts and proten sequences encoded n the genome // Computatonal and Statstcal Methods to Genomcs / Eds. Zhang W., Shmulevch I. Kluwer, Boston, 2002.. 25 7. 2. Kuznetsov V.A., ckalov V.V., Senko O.V., Knott G.D. Analyss of evolvng proteomes: redctons of the number of proten domans n nature and the number of genes n eukaryotc organsms // J. Bol. Systems. 2002. V. 0.. 38 407. 3. Kuznetsov V.A. Famly of skewed dstrbutons assocated wth the gene expresson and proteome evoluton // Sgnal rocessng. 2003a. V. 83.. 889 90. 4. Kuznetsov V.A. A stochastc model of evoluton of conserved proten codng sequence n the archaeal, bacteral and eukaryotc proteomes // Fluctuaton and Nose Letters. 2003b. V. 3. L295 L324. 5. Mulder N.J., Apweler R., Attwood T.K. et al. Interro Consortum. The Interro Database, 2003 brngs ncreased coverage and new features // Nucl. Acds Res. 2003. V. 3.. 35 38. m 297