IEMS Vol. 4, No., pp. 0-08, Jue 005. A Comparatve Study o Medcal Data Classcato Methods Based o Decso Tree ad System Recostructo Aalyss Tzug-I Tag Departmet o Iormato & Electroc Commerce Kaa Uversty, Tawa Tel: 886-3-34-500 ext., E-mal: mchael@mal.ku.edu.tw Gag Zheg Departmet o Computer Scece Ta Uversty o Techology, Ta, Cha E-mal: zheg-gag@eyou.com Yalou Huag Computer Scece Isttute Naka Uversty, Ta, Cha E-mal: yellow@aka.edu.c Guagu Shu Isttute o Systems Scece Chese Academy o Scece, Beg, Cha E-mal: guagu-shu@yahoo.com Pegtao Wag Departmet o Computer Scece Ta Uversty o Techology, Ta, Cha E-mal: wagpegtao@eyou.com Abstract. Ths paper studes medcal data classcato methods, comparg decso tree ad system recostructo aalyss as appled to heart dsease medcal data mg. The data we study s collected rom patets wth coroary heart dsease. It has,73 records o 7 attrbutes each. We use the system-recostructo method to weght t. We use decso tree algorthms, such as ducto o decso trees (ID3), classcato ad regresso tree (C4.5), classcato ad regresso tree (CART), Ch-square automatc teracto detector (CHAID), ad exhausted CHAID. We use the results to compare the correcto rate, lea umber, ad tree depth o deret decso-tree algorthms. Accordg to the expermets, we kow that weghted data ca mprove the correcto rate o coroary heart dsease data but has lttle eect o the tree depth ad lea umber. Keywords: data mg, decso tree ad system aalyss, data classcato. INTRODUCTION Data mg techques have bee appled to medcal servces several areas, cludg predcto o eectveess o surgcal procedures, medcal tests, medcato, ad the dscovery o relatoshps amog clcal ad dagoss data (Prather et al., 997, Asladoga ad Mahaa, 988). Our study s cocered wth the aalyss o coroary heart dsease data. Coroary heart dsease has become more prevalet recet years, promptg : Correspodg Author
A Comparatve Study o Medcal Data Classcato Methods Based o Decso Tree ad System Recostructo Aalyss 03 scholars to devote more atteto to ts rsk actors. Early dagoss ad treatmet s oe o the best approaches to reducg the dsease s death rate. The paper uses data rom,73 coroary heart dsease patets cases. Each case cotas about 7 attrbutes. The data come rom a hosptal clc s observatos ad allow us to get a good classcato o patets status ad behavor, rom whch we ca determe the relatoshps amog the actors. We also wat to d a data mg method to aalyze the medcal data. We wll use a system-recostructo method to do data preprocessg ad use decso-tree algorthms to do the classcato. We wat to compare the classcato correcto rate o weghted ad ot weghted data, whch s preprocessed by the system-recostructo method. I ths paper, rst we troduce the system-recostructo method ad show how the coroary heart dsease data are to be processed, ad we dscuss the theory ad algorthms o decso trees, cludg, ID3, C4.5, CART, CHAID, ad Exhausted-CHAID. We also apply these methods to medcal data mg problems by desgg some expermets to compare the correcto rate, tree depth, ad lea umber o weghted ad ot weghted data gotte by decso tree.. Data-Preparato Methods Ths study uses the mxed varable system recostructo ad predcto method, whch s the most up-todate study harvest the eld o system theory to weght data o coroary heart dsease patets. We get the weght o every actor the data ad use the results the decso tree aalyss. The system recostructo aalyss used data mg s a kd o system aalyss method, based o the costrat aalyss theory o Ashby (965). Through the eort o may scetsts rom the Uted States, the Netherlads, Germay, ad Japa (Klr, 976), the leader, theory ad methodology archtecture (Cavallo ad Klr, 979) was prcpally establshed. At that tme, the ma cocers were how to better partto the whole system to sub-systems ad how to better esure the characterstcs o the whole system rom local characterstcs. Zwck (000) ad others have addressed the rst cocer. Approaches to the secod questo have bee developed by Joes (989), who desged a computato method that uses characters recostructo ad a codto ucto that relects system eatures to esure ma ad sub-system codtos (ma relects local). Accordg to the eed to study real-world problems, he developed a mxed-varable (cotuum varable, dscrete varable, ad classcato varable) actor-aalyss method through varable recostructo aalyss (Shu 997, 998). Ths method creases the precso o quattatve results ad makes the oudato o quattatve sythesze. At the same tme, he preseted recostructo-predcto, evaluato, optmzato, ad decso-support methods (Shu 997). These methods are applcable to may elds, cludg medce, evrometal studes, ecoomy ad ace, marketg strategy, dustral maagemet, ad talet predcto.. Theory Descrpto Our model to aalyze patets data ad get the weght o every actor s as ollows: () Compute the mportace degree o actor cogregate state quota level, maxmum ad mmum value. N * max Φ max (,,, ) kl, edc k l l k N * m Φ m (,,, ) k, l edc k l l k l Ω () l Ω () Where, edc s the ormato etropy dstace o property ucto betwee hypothess system ad orgal system, edc * /edc, Ω s all levels umber collecto o actor. () Compute the value geerated rom the mportace degree o actor cogregate state quota level or the sample T. N * ( T) edc [ k, l ( T),, l ( T)] k, l R Φ (3) (3) Compute the realzg degree o every quota level or sample T. Φ ( T ) mφ Φ ( T ) max Φ m Φ (4) We use Φ kl, ( T ), to orecast treds ad select the maxmum value o the orecastg level. k (5) Compute the orecastg value Whe tred orecastg at low levels Φ ( T) E + Φ ( T) E W ( T) Φ ( T) + Φ ( T), 0,,, W s the predcate value o sample T, E s low 0 level edge, E s mddle value. I case o tred orecastg at mddle levels, Φ ( T) E + Φ ( T) E + Φ ( T) E W ( T) Φ ( T) + Φ ( T) + Φ ( T), 0,,,,,3 E s hgh level edge, I case o tred orecastg at hgh levels, (4) (5) (6)
04 Tzug-I Tag Gag Zheg Yalou Huag Guagu Shu Pegtao Wag Φ ( T) E +Φ ( T) E W ( T) Φ ( T) + Φ ( T), 0,3,,3. Patet case dgtalzato Frst, we wll expla how we get the patet s data. The,73 patets data records clude the ollowg e groups: ache, allevato method, sg, persoal hstory, blood at, electrocardogram, ultrasoc cardogram (UCG), ad Holter. Parameters clude geder, temperamet, age, ache character, posto, tme, cause, pulse, blood pressure, amly hstory, smokg hstory, smokg amout, alcohol hstory, hgh blood pressure hstory, dabetes hstory, blood sugar, urc acd, ad arrhythma. Table shows our patet-case dgtalzato method. Table. Patet case dgtalzato Sg pulse Blood pressure Systole Blood pressure Dastole Speed (7) Rhythm V9 V0 V V V3 umber umber umber Tmes/s :yes -:o.3 Result o data recostructo aalyss Table. Patets actor weght sortg table Order Factor Factor weght (0-) Blood pressure systole hgh Cause o drkg 0.89833 3 Blood pressure dastole hgh 0.78503 4 Female 0.7903 5 Atral premature beats, Atral tachycarda 0.6503 6 Myocardal ezyme GOT serous 0.5834 7 Cause o sleepg 0.5743 8 Myocardal ezyme CKMB serous 0.503684 9 No takg glooe 0.494534 0 More locatos o ache 0.476548 : : : 7 Factors weght by recostructo aalyss s based o the real dagoss result. Because the pathogec o coroary heart dsease ad the pheomea dagosg are dversorm, some actors take a bg role the heart dsease o patets, whle others do ot. Thereore we get a weght table rom reco-structo aalyss that s based o the real dagoss result; t ca be used the ext phase o aalyss. Table s the weght lst or the actors o the sample problem..4 Data or ext aalyss Ater we use the system recostructo method to aalyze the patet s data, we get the actor weght, whch s mportat or the ext step our aalyss. Whe we dgtalze the data, we do ot cosder the relatoshp amog the actors. I act, they are related, ad ther degrees o mportace to coroary heart dsease are deret. Sce we have the weght o data, we ca process the orgal dgtalzed data wth the actor weght, ad the data wll be used the ext aalyss. The we get two data sets, weghted ad ot weghted, whch we use decso-tree aalyss ad our attempt to lear whch ca get better results. 3. Decso-tree method 3. Algorthm Descrpto The decso tree ducto algorthm has bee used broadly or several years. It s a approxmato dscrete ucto method ad ca yeld lots o useul expressos. It s oe o the most mportat methods or classcato. Ths algorthm s terms ollow the tree metaphor. It has a root, whch s the rst splt pot o the data attrbute or buldg a decso tree. It also has leaves, so that every path rom root to lea wll orm a rule that s easly uderstood. Sce the decso tree s bult by gve data, the data value ad character wll be more mportat. For example, the amout o data wll aect the result o the treebuldg procedure. The type o attrbute value wll also aect the tree model. Decso trees eed two kds o data: trag ad testg. Trag data, whch are usually the bgger part o data, are used or costructg trees. The more trag data collected, the hgher the accuracy o the results. The other group o data, testg, s used to get the accuracy rate ad msclasscato rate o the decso tree. May decso-tree algorthms have bee developed. Oe o the most amous s ID3 (Qula 986, 983), whose choce o splt attrbute s based o ormato etropy. C4.5 s a exteso o ID3 (Prather et al. 997). It mproves computg ececy, deals wth cotuous values, hadles attrbutes wth mssg values, avods over ttg, ad perorms other uctos. CART (classcato ad regresso tree) s a data-explorato ad predcto algorthm smlar to C4.5, whch s a treecostructo algorthm (Martíez ad Suárez, 004). Brema et al. (984) summarzed the classcato ad
A Comparatve Study o Medcal Data Classcato Methods Based o Decso Tree ad System Recostructo Aalyss 05 regresso tree. Istead o ormato etropy, t troduces measures o ode mpurty. It s used o a varety o deret problems, such as the detecto o chlore rom the data cotaed a mass spectrum (Berso ad Smth, 997). CHAID (Ch-square automatc teracto detector) s smlar to CART, but t ders choosg a splt ode. It depeds o a Ch-square test used cotgecy tables to determe whch categorcal predctor s arthest rom depedece wth the predcto values (Bttecourt ad Clarke, 003). It also has a exteded verso, Exhausted- CHAID. Although decso trees may ot be the best method or classcato accuracy, eve people who are ot amlar wth them d them easy to use ad uderstad. Fgure shows a bary decso tree. It gves us a mpresso o a decso. It uses a crcle as the decso ode ad a square as the termal ode. Each decso ode has a codto that s represeted by a ucto F, ad the parameter s the splt pot o the splt attrbute. Each termal ode has a class label C, the value o whch represets a class. It s apparet that t s easy to use decso trees to terpret the tree to rules, rom whch we ca do aalyss, ad easy to terpret the represetato o a olear put-output mappg (Jag 994). () Dee ucto ad expresso: Deto. D s deed as a trag data set whose attrbutes are dvded to two parts: o-target ad target. The o-target attrbute s amed as Q (Q,,Q m ), where each attrbute Q (<<m) takes k values { a,..., a }. The target attrbute (usually ust k oe attrbute) s amed as C. Suppose t has l values; thus we get l classes, C{C,,C l }. Let D be a subset D whose class s C ad D be the umber o elemets D. The ormato etropy o data set D s deed as: l ED ( ) ( Plog P) (8) Where P s the proporto o D belogg to class P (9) Deto. The measure o the mpurty a collecto o trag examples s deed as ormato ga, Ga (D, Q ), o attrbute Q : GaDQ (, ) ED ( ) EDQ (, ) m (0) k EDQ (, ) ( ( )) m P ED () Where, D s the obtaed th subset whch s dvded by attrbute Q o D, ad P K () Fgure. A typcal bary decso tree Lots o works address the splttg ode choosg method ad optmzato o tree sze, but less atteto has bee gve to the weght o the data attrbutes. I ths study, we use a system-recostructo aalyss method to get the weght o each attrbute, whch we use to reorm raw data. Ater that, we use the decso-tree algorthm metoed above to buld a decso tree, rom whch we ca d the decso-accuracy ad msclasscato rates. 3. Iducto o decso trees algorthm, ID3 ID3 s a typcal decso-tree algorthm. It troduces ormato etropy as the splttg attrbute s choosg measure. It tras a tree rom root to lea, a top-dow sequece. Each path rom that orm s a decso rule. We wll dscuss the theory o ID3 below. () Processg The target o the ID3 algorthm s to search the attrbute wth maxmum ormato ga, ad to use the attrbute as the splttg attrbute. Thus, the deto o ormato etropy becomes a mportat case to study, or perect etropy s more reasoable classcato. 3.3 Regresso tree algorthm, C4.5 C4.5 s a exteded verso o ID3. It mproves approprate attrbute selecto measure, avods data over ttg, reduces error prug, hadles attrbutes wth deret weght, mproves computg ececy, hadles mssg value data ad cotuous attrbutes, ad perorms other uctos. It s based o the dea o ID4, stead o ormato gaed ID3, ad t troduces a ormato ga rato.
06 Tzug-I Tag Gag Zheg Yalou Huag Guagu Shu Pegtao Wag We also use the data set used ID3 to expla the theory o C4.5. Deto 3. V has values whch are ot repeated, show as {V,,V }, ad D s separated to subsets D, D,,D. D s the example umber o data set D. T s the umber o example that VV, C req(c, T), umber o example o C C v s the umber o example o C where VV. Probablty o C : C req( C, T) PC ( ) (3) Probablty whe V v : Pv ( ) C v Probablty o C whe V v : PC ( v) Deto 4. Iormato ga rato () Class ormato etropy: C C EC ( ) PC ( ) log( pc ( ) log( ) k req ( C, T ) req ( C, T ) log ( ) T o(t) (4) () Class codto etropy: EC ( V) Pv ( ) ( )log ( ) PC v PC v C C v v log o(t ) o v (T) (5) (3) Iormato ga Ga(C,V) E(C) - H(C V) o(t) ov(t) (4) Iormato etropy o attrbute V T EV ( ) Pv ( )log( Pv ( )) - T T log ( ) splt o(v) (6) T (5) Iormato ga rato Ga - rato(v) E(C,V)/E(V) ga(v)/splt-o(v) (7) C4.5 uses a ormato ga rato select attrbute to splt, whch yelds better ormato ga tha ID3. 3.4 CART CART (classcato ad regresso tree) s aother decso tree algorthm developed by Brema (Brema et al. 984). The tree s costructed based o the trag set ad the prued by the mmum cost-complexty prcple (Jag 994). Ulke C4.5, whch uses ormato etropy as the measuremet o choosg a splttg attrbute, t uses mpurty. Some key theores are show below (Prather et al. 997): Impurty, k t ( ) pw ( t)log pw ( t) (8) Best dvso, ( s, t) ( t) p ( t ) p ( t ) (9) L L R R I there s o sgcat decrease the mpurty measuremet, ad the ext dvsos caot be completed, ode t wll be the termal ode. The class w related to termal ode t s that whch maxmzes the codtoal probablty pw ( t ) (Prather et al. 997). 3.5 Ch-squared automatc teracto detector, CHAID CHAID s oe o the oldest tree-classcato methods. It was orgally proposed by Kass (980). Accordg to Rpley, 996, the CHAID algorthm s a descedet o THAID developed by Morga ad Messeger, 973. CHAID grows o-bary trees through a relatvely smple algorthm that s partcularly well suted or the aalyss o larger data sets, ad t has bee partcularly popular marketg research. The algorthm proceeds as ollows: () Preparg predctors: create categorcal predctors out o ay cotuous predctors by dvdg the respectve cotuous dstrbutos to a umber o categores wth a approxmately equal umber o observatos. () Mergg categores: cycle through the predctors to determe or each predctor the par o (predctor) categores that s least sgcatly deret wth respect to the depedet varable; or classcato problems (where the depedet varable s categorcal as well), t wll compute a Ch-square test (Pearso Ch-square). I the statstcal sgcace or the respectve par o predctor categores s sgcat, the t wll compute a Boerro-adusted p-value or the set o categores or the respectve predctor. (3) Selectg the splt varable: choose the splt-the-predctor varable wth the smallest adusted p-value,.e., the predctor varable that wll yeld the most sg-
A Comparatve Study o Medcal Data Classcato Methods Based o Decso Tree ad System Recostructo Aalyss 07 Table 3. Comparso o expermetal results by varous algorthms Correcto rate comparso o decso tree algorthms decso-tree algorthm data type teral ode umber max. tree depth leaves umber[0] correcto rate C4.5 raw data 33 5 35 83.3% weghted data 6 5 5 86.3% CART raw data 0 6 73.4% weghted data 8 6 9 78.4% CHAID raw data 9 3 6 79.7% weghted data 3 8.% Exhaustve CHAID raw data 5 3 9 78.% weghted data 7 3 0 8.8% cat splt; the smallest (Boerro) adusted p- value or ay predctor s greater tha some alpha-tosplt value, the o urther splts wll be perormed ad the respectve ode s a termal ode. Cotue ths process utl o urther splts ca be perormed. Exhaustve CHAID, a modcato o the basc CHAID algorthm, perorms a more thorough mergg ad testg o predctor varables, ad hece requres more computg tme. For large data sets, ad those wth may cotuous predctor varables, ths modcato o the smpler CHAID algorthm may requre sgcat computg tme. 4. EXPERIMENT DESIGN Ater we dgtalzed the CHD (coroary heart dsease) data (,73 records o 7 attrbutes), about,400 records were used as trag sets; the remag 33 were cosdered as the testg data sets. From these, we get two kds o data: raw ad weghted. Attrbutes ths part o data have equal probablty, whle the other part o data s weghted by the system recostructo method so that each attrbute has a weght value. We wll use these two kds o data as the expermet data. The aalyss methods the expermet wll be C4.5, CART, CHAID, ad Exhaustve CHAID. We use every algorthm o the raw ad weghted data to compare the decso-tree parameters ad correcto rate. The results are show Table 3. Fgure clearly reveals that the weghted data have a hgher correcto rate tha the raw data. A good decso tree usually s udged by the ollowg aspects: mmum lea umber, mmum tree depth, ad correcto rate. From the expermets we leared that weghted data ca get a better correcto rate tha raw data. From the same data set we ca see that CHAID ca get mmum lea umber ad mmum tree depth, C4.5 the mddle ad CART the last posto (see Fgure 3). From the gure, we ca see that whether data s weghted does ot aect the two parameters. 50 00 50 00 50 0 Max tree depth(raw data) Max teee depth(weghted data) lea umber(raw data) lea umber(weghted data) C4.5 CART CHAID E-CHAID Fgure 3. Decso-tree algorthm parameters comparso Deret sgs represet deret parameters ad deret data sets: damods are maxmum tree depth o ot-weghted data; empty crcles are the maxmum tree depth o weghted data; tragles are lea umbers o otweghted data; ad empty cubes are the lea umber o weghted data. 90.00% raw data weghted data 5. CONCLUSION Correcto rate 85.00% 80.00% 75.00% 70.00% 65.00% C4.5 CART CHAID E-CHAID Fgure. Correcto rate comparso o decso tree The decso-tree algorthm s oe o the most eectve classcato methods. The data we used the paper were collected drectly rom clcal dagoses, ad ther relablty cormed by coroary artery radography. The data wll udge the ececy ad correcto rate o the algorthm. From the data we get the cocluso that data weghted by the system-recostructo method ca get a hgher correcto rate but wll have lttle eect o the lea umber ad tree depth o the decso tree.
08 Tzug-I Tag Gag Zheg Yalou Huag Guagu Shu Pegtao Wag The work wll be several parts. Frst, we wll compare the methods o weghtg the data, whch wll d the best method or weghtg data used decsotree classcato. Secod, we wll study the deret classcato methods, such as eural etworks ad geetc algorthms, based o the weghted data that we studed beore. Thrd, we wll study prcple compoet aalyss, rough set, eature selecto to reduce the attrbutes o the coroary heart dsease data. Reereces Ashby, W. R. (965), Costrat Aalyss o May-dmesoal Relatos, Geeral Systems Yearbook, 9, 99-05. Asladoga, Y. A. ad Mahaa, G. A. (004), Evdece Combato Medcal Data mg, Proceedgs o Iteratoal Coerece o Iormato Techology: Codg ad Computg, IEEE. Berso, A. ad Smth, S. J. (997), Data Warehousg, Data Mg, & OLAP, 365, McGraw-Hl. Bttecourt, H. R. ad Clarke, R. T. (997), Use o Classcato ad Regresso Trees (CART) to Classy Remotely Sesed Dgtal Images, IEEE, 375-3753. Brema, L., Fredma, J. H., Olshe, R. A., ad Stoe, C.J. (984), Classcato ad Decso trees, Belmot, CA: Wadsworth. Cavallo, R. E. ad Klr, G.. J. (979), Recostructablty Aalyss o Mult-dmesoal Relatos: A Theoretcal Bass or Computer-aded Determato o Acceptable Systems Models, It. J. o Geeral Systems, 5, 43-7. Cavallo, R. E. ad Klr, G.. J. (98), Recostructablty Aalyss: Evaluato o Recostructo Hypotheses, It. J. o Geeral Systems, 7, 7-3. Jag, J.-S. R. (994), Structure Determato Fuzzy Modelg: A uzzy CART Approach, IEEE, 480-485. Joes, B. (998), A Program or Recostructablty Aalyss, It. J. o Geeral Systems, 5, 99-05. Klr, G.J. (976), Idetcato o Geeratve Structures Emprcal Data, It. J. o Geeral Systems, 3, No., 89-04. Martíez-Muñoz, G. ad Suárez, A. (004), Usg all Data to Geerate Decso Tree Esemble, IEEE Tra. O Systems, Ma ad Cyberetcs part C: Applca-tos ad Revew, 34, No. 4. 393-397. Prather, J. C., Lobach, D. F., Goodw, L. K., Hales, J. W., Hage, M. L., ad Hammod, W. E. (997), Medcal Data Mg: Kowledge Dscovery a Clcal Data Warehouse, Proc AMIA Aual Fall Symposum, 0-5. Qula, J. R. (983), Learg Ecet Classcato Procedures ad Ther Applcato to Chess ad Games, Mache Learg: A Artcal Itellgece Approach,, 463-48. Qula, J. R. (986), Iducto o Decso Trees, Mache Learg,, 8-06. Shu, G. (997), Recostructo Aalyss Methods or Forecastg, Rsks, Desg, Dyamcal Problems ad Applcatos, Secod Workshop o IIGSS, 7-79. Shu, G. (998), Recostructablty Aalyss wth Multversty Iormato ad Kowledge, Systems Scece ad ts Applcatos, Thrd Workshop o IIGSS, 69-74. Shu, G. (000), Recostructablty Aalyss Cha Specal Issue, It. J. o Geeral Systems, 9, No. 3. Zwck, M. ad OCCAM (000), A Recostructablty Aalyss Sotware Package, World Cogress o the Systems Sceces/44th Aual Meetg o ISSS, Toroto, Caada, July, 6-.