1 Addtonal Fle 1 - A model-based ccula bnay segmentaton algothm fo the analyss of aay CGH data Fang-Han Hsu 1, Hung-I H Chen, Mong-Hsun Tsa, Lang-Chuan La 5, Ch-Cheng Huang 1,6, Shh-Hsn Tu 6, Ec Y Chuang* 1, and Ydong Chen*, 1 Gaduate Insttute of Bomedcal Electoncs and Bonfomatcs, Depatment of Electcal Engneeng, Natonal Tawan Unvesty, Tape 106, Tawan, Geehey Chlden's Cance Reseach Insttute, The Unvesty of Texas Health Scence Cente at San Antono, San Antono, TX 789, USA, Depatment of Epdemology and Bostatstcs, The Unvesty of Texas Health Scence Cente at San Antono, San Antono, TX 789, USA, Insttute of Botechnology, Cente fo Systems Bology and Bonfomatcs, Natonal Tawan Unvesty, Tape 106, Tawan, 5 Gaduate Insttute of Physology, Natonal Tawan Unvesty, Tape 100, Tawan, 6 Cathy Geneal Hosptal, Tape 106, Tawan Contents Compason Platfom ----------------------------------------------------------------------------------- 1 Typcal Estmates of Skewness and Kutoss fom Real acgh Data ----------------------- Compason of Pefomance between the Hybd CBS and ecbs -------------------------- The Valdty of acgh Data Smulaton Usng the Peason System --------------------- Altenatve Estmatos of Skewness & Kutoss ------------------------------------------------ 5 Supplementay Mateals Compason Platfom Tme consumpton studes wee made fo compang the algothm pefomance n speed usng the hybd CBS and ecbs. The hadwae fo compason s IBM xseve 5 wth two Xeon.GHz CPUs and 1G RAM. As fo softwae, DNAcopy veson 1.16 dstbuted though Boconducto/R [http://www.boconducto.og/ was nstalled. All paametes of CBS wee set as default, and the mbedded smoothng functon was used fo emovng outles. If not specal specfed, sgnfcance theshold of maxmal-t test was set as p-value < 0.01. Addtonal Fle 1
Typcal Estmates of Skewness and Kutoss fom Real acgh Data Supplementay Fgue 1 shows typcal estmates of skewness and kutoss fom eal acgh data. These values ae obtaned fom 10 beast cance acgh samples usng the Aglent Human Genome CGH 105A and 11 human globlastoma GBM acgh samples (GSE9177) usng the Aglent A human CGH aays. As shown n the fgue, acgh data ae typcally skewed wth -0. < skewness < 0. and heavy-taled wth.0 < kutoss <.5. Supplementay Fgue 1. Estmates of skewness and kutoss on eal data. These values ae obtaned fom (a) 10 beast cance acgh samples usng the Aglent Human Genome CGH 105A and (b) 11 human globlastoma GBM acgh samples (GSE9177) usng the Aglent A human CGH aays. Pe-segmentaton was appled befoe evaluatng the estmates; ths s to avod estmaton bas due to extemely lage values n the data. Compason of Pefomance between the Hybd CBS and ecbs The smulated data usng the second model mentoned n the atcle contans 1,500 pobes (N = 1,500) and one change-pont nea the edges o two change-ponts n the cente of the chomosomes. The locatons and ampltudes of change-ponts wee contolled by m = cvi, whee I s an ndcato functon, whch equals 1 fo segments between l < x < (l + k) and 0 othewse. Paamete k efes to the wdth of the vaaton, and l efes to the locaton of the vaaton. Supplementay Table 1 shows the esults. Addtonal Fle 1
Algothm 1 : hybd CBS (DNAcopy1.16) Algothm : ecbs Change- ponts ( edge ) Change- ponts ( cente ) k c methods Exact 0 1 >=5 Exact 0 1 >=5 5 hybd CBS 1 97 17 9 0 0 0 1 971 0 9 0 0 0 ecbs 18 968 10 0 0 0 15 966 0 0 1 0 hybd CBS 179 789 199 11 1 0 0 15 800 0 195 0 5 0 ecbs 18 77 1 11 1 0 0 180 767 0 8 0 5 0 hybd CBS 6 16 668 7 8 1 0 55 99 0 59 0 8 0 ecbs 695 69 719 9 1 0 60 0 656 0 11 0 hybd CBS 65 898 91 11 0 0 0 909 0 90 0 1 0 ecbs 71 888 99 1 0 0 0 5 897 0 101 0 0 hybd CBS 9 9 589 10 7 1 0 9 06 0 58 0 10 0 ecbs 77 77 608 11 1 0 57 89 0 60 0 9 0 hybd CBS 96 5 10 0 0 86 5 0 958 1 0 ecbs 0 0 97 0 0 866 7 0 956 0 17 0 hybd CBS 117 801 187 11 1 0 0 11 76 0 0 0 ecbs 16 786 01 1 1 0 0 15 78 0 8 0 0 hybd CBS 715 18 85 9 10 0 0 599 15 0 88 0 17 0 ecbs 76 11 865 9 0 1 60 19 0 857 0 1 0 hybd CBS 99 1 98 5 11 0 0 888 0 980 0 18 0 ecbs 9 1 988 6 0 1 890 0 98 0 16 0 hybd CBS 7 6 6 11 1 0 161 67 0 65 0 8 0 ecbs 59 600 8 1 1 1 165 61 0 78 0 10 0 hybd CBS 80 950 7 10 0 0 667 0 0 95 1 0 ecbs 807 7 958 8 6 0 1 668 0 951 0 16 0 hybd CBS 90 0 986 10 0 0 879 0 0 98 0 16 0 ecbs 90 0 986 10 0 1 87 0 0 978 0 0 Supplementay Table 1. The numbe of change-ponts detected by the hybd CBS and ecbs. We appled these methods to 1,000 datasets; each of them contans 1,500 pobes smulated fom the nomal dstbuton. The Exact columns count the numbe of cases n whch the segmentaton esults exactly match the desed numbe (1 fo edge and fo cente) and locatons of change-ponts. Hee k s the wdth of the changed segment and c s the numbe of standad devatons between the two means. Each dataset had one elevated egon angng fom to 5 ponts, and the elevated egon vaed fom to SDs above the mean. The cutoff of p-value fo the smulaton was 0.01. Addtonal Fle 1
The Valdty of acgh Data Smulaton Usng the Peason System In the study, the Peason system was assumed suffcent to smulate a wde ange of acgh data unde the null condton (no change-ponts). To assess the valdty of ou assumpton, we smulated seveal datasets usng the Peason system and compaed the dstbuton of smulated data to the dstbuton of eal acgh data usng a two-sample Kolmogoov-Smnov test (KS-test). One of the 10 beast cance and 11 globlastoma GBM acgh data ndcated n Secton Methods - Real acgh Data was selected and pe-segmented; afte the pe-segmentaton pocess, the skewness and kutoss of the aay wee estmated. Usng the estmates of mean, standad devaton, skewness and kutoss fom the selected aay as nput paametes, we andomly geneated 1,000 pobes usng a Matlab functon peasnd() as the smulated data fo hypothess testng. Addtonally, we andomly pcked up 1,000 pobes fom the selected aay afte pe-segmentaton. Ths set of pobes s the eal data fo hypothess testng. Now we have one smulated sample fom the Peason system and one eal sample fom the selected aay. A two-sample KS-test wth the null hypothess - the two datasets unde consdeaton ae fom the same contnuous dstbuton - was appled. If the p-value s smalle than alpha = 0.01, we eect the null hypothess. Real Data Sze 1000 Sze 100 Real Data Sze 1000 Sze 100 Aay #10 0 0 GSM188 0 Aay #19 0 1 GSM189 1 1 Aay # 0 1 GSM1850 Aay #8 1 1 GSM1851 0 Aay # 0 1 GSM185 0 Aay #5 1 GSM185 1 1 Aay #8 0 0 GSM185 0 1 Aay #65 0 0 GSM1855 0 1 Aay #7 1 GSM1856 7 1 Aay #78 0 1 GSM1857 0 GSM1858 1 1 Supplementay Table. The numbe of tmes among 100 that the p-values of the two-sample KS-test ae smalle than 0.01. Real data ae dawn fom the acgh data labeled n the column Real Data, whle the smulated data ae dawn fom the Peason system wth the paametes, mean, standad devaton, skewness, and kutoss, beng set as the same as the estmates deved fom the coespondng aay. The column Sze 1000 efes to the cases wth 1,000 pobes, and the column Sze 100 efes to the cases wth 100 pobes. Addtonal Fle 1
5 We epeated the pocess fo 100 tmes pe aay and lsted the numbe of tmes that the p-values ae smalle than 0.01. Supplementay Table shows the esults. As shown n the table, whethe data sze s lage (sze = 1000) o small (sze = 100), vaables dawn fom the eal data and vaables dawn fom the Peason system dd not lead to statstcally sgnfcant dffeence n dstbuton. Ths ndcates that ou assumpton - the Peason system can smulate acgh data - s sound and most lkely coect. Altenatve Estmatos of Skewness & Kutoss To avod estmaton bas due to copy numbe alteatons n data, we ted altenatve estmatos fo the nd, d, and th cental moments as follows. Let 1,,..., n denote ndependent and dentcally dstbuted (..d.) andom vaables wth E[ = μ. We ae hee nteested n devng the nd, d, and th cental moments. Assumng new andom vaables 1,,,,,, and, as 1,,,, 1,,,, whee 1 n 1 fo 1,, 1 n fo,, 1 n fo,, and 1 n fo,. An unbased estmato fo the nd cental moment mˆ has been poposed and s gven by mˆ E[( E[ E[ 1, ) 1, Smlaly, an unbased estmato fo the d cental moment by mˆ E[( E[ E[ n ) E[ 1, 1,, 1 n,. (s1) mˆ s poposed and gven, (s) Addtonal Fle 1
6 and an unbased estmato fo the th cental moment mˆ s poposed and gven by mˆ E[( E[ E[ n 1 ) E[ 1, 1,,, n,, 6E[,,. (s) Estmates of the skewness and kutoss of acgh data can theoetcally be deved usng ˆ, ˆ, and ˆ, whch ae gven by m m m / mˆ skewness mˆ, mˆ kutoss mˆ. The motvaton of usng the dffeence between neghbong pobes, 1,, nstead of the ognal data, s fom the obsevaton (shown n Supplementay Fgue ) that bas due to copy numbe changes can be vtually emoved. As shown n subplot (a), the ognal data contans a egon of obvous copy numbe gan, whle afte conveson (fom to ), as shown n subplot (b), egonal nose was conveted to pont nose. 1, Pactcally, the standad devaton of, 1, 1,, can be easly acheved usng the medan of absolute devaton (MAD) to avod the nfluence of pont nose, o, Usng the data conveson (fom to 1.785 MAD( ). 1, 1, 1, ) and the MAD method, we can obtan obust estmates of the nd cental moment, mˆ. Howeve, fo the d and th cental moment, we cannot smply apply the mean o the MAD opeatos to get obust estmates. Multple easons ae povded: 1) The nput acgh data,, may not satsfy the assumpton of ndependence completely. Whle we may get good estmates of standad devaton, the devaton of d and th moments eques much stngent ndependence condton; ) The MAD fo estmatng standad devaton eques a nomal dstbuton of 1,. Ths s not the case fo d and th moments, whee we cannot assume a symmetc dstbuton (skewness). Addtonal Fle 1
7 ) The mean opeato s pone to pont nose, whethe the nose s due to the conveson fom to, o due to the ntnsc nose fom aay measuement. 1, Ou expeence wth eal acgh data ndcates that the estmates povded by Eqs. (s1, s, s) wee not obust enough due to above easons. Thus, we appled the pe-segmentaton pocess to get accuate estmates of skewness and kutoss n the study. Pont nose due to conveson Supplementay Fgue. Suppose a egon (pobe #51 to #100) of copy numbe gan lfts the sequental data by 1, (a) estmatng cental moments fom the ognal acgh data may lead to based esults due to egonal noses fom CNAs; (b) estmatng cental moments fom dffeences between neghbong pobes can esult n mnmzed bas (pont nose). Addtonal Fle 1