DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION Dr. S. Vjayaran 1, Mr.S.Dhayanand 2, Assstant Professor 1, M.Phl Research Scholar 2, Department of Computer Scence, School of Computer Scence and Engneerng, Bharathar Unversty, Combatore, Tamlnadu, Inda 1, 2. ABSTRACT Data mnng s a non-trval process of categorzng vald, novel, potentally useful and ultmately understandable patterns n data. In terms, t accurately state as the extracton of nformaton from a huge database. Data mnng s a vtal role n several applcatons such as busness organzatons, educatonal nsttutons, government sectors, health care ndustry, scentfc and engneerng.. In the health care ndustry, the data mnng s predomnantly used for dsease predcton. Enormous data mnng technques are exstng for predctng dseases namely classfcaton, clusterng, assocaton rules, summarzatons, regresson and etc. The man objectve of ths research work s to predct kdney dseases usng classfcaton algorthms such as Naïve Bayes and Support Vector Machne. Ths research work manly focused on fndng the best classfcaton algorthm based on the classfcaton accuracy and executon tme performance factors. From the expermental results t s observed that the performance of the SVM s better than the Nave Bayes classfer algorthm. KEYWORDS Data mnng, Dsease predcton, SVM, Naïve Bayes, Glomerular Fltraton Rate (GFR) 1. INTRODUCTION Data mnng s an approach whch dspense an ntermxture of technque to dentfy a block of data or decson makng knowledge n the database and eradcatng these data n such a way that they can be put to use n decson support, forecastng and estmaton [11]. The data s often volumnous, but t has data that s useful. Two major preferred models that can be created n data mnng are predctve and descrptve. Under these two models there are varous tasks that are used n the data mnng process. On bass of varous hstorcal data a predctve model makes estmaton about values of data usng recognzed results found from varous data. On the other sde, descrptve model dentfes patterns or relatonshps n data. Unlke the predctve model, a descrptve model oblges as way to explore the propertes of the data observed, not to predct new propertes [5]. The algorthms are many n every sngle task under both the data mnng models whch are used for varous purposes accordng to the convenent of the use requrements. The varous tasks of the predctve and descrptve models are classfcaton, clusterng, summarzaton, predcton, tme seres analyss, assocaton rules and regresson [3]. DOI: 10.5121/jc.2015.4402 13
In order to antcpate soluton set for varous problems data mnng technque endeavors dstnctve data mnng tasks such as classfcaton and clusterng. It provdes affrmaton about the predcted solutons n terms of the stablty n predcton and n frequency of legtmate predctons. Based on data mnng technques, many experts develop ther research successfully. Some of the technque ncludes statstcs, machne learnng, decson trees, hdden markov models, genetc algorthm, Meta learnng and so on. Data mnng systems depends on database to supply the raw nput and ths rases problems, such as that database tends to be dynamc, ncomplete, nosy and large. Other problems arse as a result of the nsuffcency and nsgnfcance of the nformaton stored. The major ssues n data mnng can be categorzed as nose or mssng data, Lmted nformaton, user nteracton, pror knowledge, uncertanty, sze, updates and rrelevant felds. The medcal data mnng has the elevaton potental n medcal doman for extractng the hdden patterns n the dataset [9]. These patterns are used for medcal dagnoss and prognoss. The medcal data are globally scattered, heterogeneous, exaggerate n nature. In order to ncur a user orented approach to novel and hdden patterns of the data, the data should be concerted together [16]. A major problem n health scence or bonformatcs exploraton s n managng the correct dagnoss of certan mportant nformaton. Generally multtudnous tests nvolve the classfcaton or clusterng of large scale data for the purpose of esteemed scrutny. The test procedures are assumed to be essental n order to reach the ultmate dagnoss. Else,more number tests could obfuscate the man dagnoss process whch may result n trouble n ganng the end results, predomnantly n the perceptvely of fndng dsease many tests should be performed [12]. Ths sort of dffculty could be fxed wth the support of machne learnng whch could be used drectly to obtan the end result wth the assstance of several artfcal ntellgent algorthms whch perform the role as classfers. Classfcaton s one of the most mportant technques n data mnng. In order to perform classfcaton process, classfyng the data has to be done proceed by codng and then placed nto chunk that are submssve by a human. Ths research work descrbes classfcaton algorthms and t also analyzes the performance of these algorthms. The performance factors used for analyss are classfcaton accuracy and executon tme. The man objectve of ths research work s to predct kdney dseases (Acute Nephrtc Syndrome, Chronc Kdney dsease, Acute Renal Falure, Chronc Glomerulonephrts) usng classfcaton algorthms namely SVM and naïve bayes and fndng the effcent algorthm. The remanng porton of the paper s organzed as follows. Related works are dscussed n Secton 2. The proposed methodology s gven n Secton 3. Secton 4 analyzes the expermental results. Secton 5 gves concluson. 2. LITERATURE REVIEW Govann Caocc et.al [7] In order to predct Long Term Kdney Transplantaton Outcome, they nterpreted dscrmnaton between an Artfcal Neural Network and Logstc Regresson. Comparson has been done based on the Senstvty and specfcty of Logstc Regresson and an Artfcal Neural Network n the predcton of Kdney rejecton n ten tranng and valdatng datasets of kdney transplant recpents. From the expermental results that both the algorthm 14
approaches were complementary and ther combned algorthms used to mprove the clncal decson-makng process and prognoss of kdney transplantaton. Lakshm.K.R et al [10] analyzed Artfcal Neural Networks, Decson tree and Logcal Regresson supervsed machne learnng algorthms. These algorthms have been used for Kdney dalyss. For classfcaton process they used a data mnng tool named Tanagra. The 10 fold cross valdaton s used n order to evaluate the classfed data proceeded by the comparson of those data. From the expermental result they absorbed that ANN performed better than the Decson tree and Logcal Regresson algorthms. Tommaso D Noa et.al [14] developed a software tool that explots the power of artfcal neural networks to classfy patents health status potentally leadng to End Stage of Kdney Dsease (ESKD). The classfer nfluences the results returned by an ensemble of ten networks traned by usng data collected n a perod of thrty eght years at Unversty of Bar. The tool whch has been refned has been made dervable both as an onlne web applcaton and as an androd moble app. The developed tool s mportant to clncal usefulness based on the largest cohort worldwde. Anu Chaudhary et al [2] developed a predcton system usng A-pror and k-means algorthm for heart dsease and kdney falure predcton. In her survey A-pror and k-mean algorthm algorthms have been used to predct kdney falure patent wth 42 attrbutes. They analyzed the data usng machne learnng tools such as dstrbuton and attrbute statstcs, followed by A-pror and k-means algorthms. They evaluated the data usng Recever Operatng Characterstc (ROC) plot and calbraton plots. Andrew Kusak et al [1] have used data preprocessng, data transformatons, and a data mnng approach to elct knowledge about the nteracton between many of measured parameters and patent survval. Two dfferent data mnng algorthms were engaged for extractng knowledge n the form of decson rules. Those rules were used by a decson-makng algorthm, whch predcts survval of new unseen patents. Important parameters dentfed by data mnng were nterpreted for ther medcal sgnfcance. They have ntroduced a concept n ther research work have been appled and tested usng collected data at four dalyss stes. The approach presented n ther paper reduces the cost and effort of selectng patents for clncal studes. Patents can be chosen based on the predcton results and the most mportant parameters dscovered. 3. METHODOLOGY 3.1 Dataset The synthetc kdney functon test (KFT) dataset have been created for analyss of kdney dsease. Ths dataset contans fve hundred and eghty four nstances and sx attrbutes are used n ths comparatve analyss. The attrbutes n ths KFT dataset are Age, Gender, Urea, Creatnne and Glomerular Fltraton Rate (GFR). Ths dataset conssts of renal affected dseases. Blood Urea Ntrogen: Urea s a surplus product that s elmnated by the kdneys. Ntrogen s a dervatve product from urea, also elmnated by kdneys. When kdney functon reduces, the BUN may be elevated. 15
Creatnne: ths s an excess product of muscles and s normally elmnated by the kdneys. When kdney functon reduces, the creatnne may be elevated. Glomerular Fltraton Rate (GFR): Ths s an essental measure and t s used to calculate the creatnne clearance. Normally ths measure s calculated by usng the followng attrbutes; they are, age, body, sex of the patent and creatnne. Ths measure s consdered as the best measure for fndng the kdney functon level and t s represented n percentage (.e.30%). Dataset Classfcaton Algorthms Naïve Bayes SVM Performance Accuracy SVM Fgure 1. System Archtecture 3.2 Classfcaton Classfcaton t maps data nto predefned groups or classes. In classfcaton the classes are ndomtable before examnng the data thus t s often mentoned as supervsed learnng. Classfcaton s the process whch classfes the collecton of objects,datas or deas nto groups, the members of whch have one or more characterstc n common. In ths research work Naïve Bayes, SVM, ANN and proposed algorthm namely ANFIS are used to classfy dfferent stages of Chronc Kdney Falure dsease from the dataset [4]. 3.2.1 Naïve Bayes A Nave Bayes classfer s a smple probablstc classfer based on applyng Bayes' theorem (from Bayesan statstcs) wth strong (nave) ndependence assumptons. A more descrptve term 16
for the underlyng probablty model would be "ndependent feature model". Ths restrcted ndvdualty assumpton nfrequently clutches true n real world applcatons, hence the characterzaton as Nave yet the algorthm nclnes to perform well and learn rapdly n varous supervsed classfcaton problems [6]. An advantage of the nave Bayes classfer s that t only requres a small amount of tranng data to estmate the parameters (means and varances of the varables) necessary for classfcaton. Because ndependent varables are assumed, only the varances of the varables for each class need to be determned and not the entre covarance matrx. Table 1 represents and explans the Bayes theorm Table 1. Bayes Theorm Bayes theorem: 1. P (C X) = P (X C) P(C) / P(X). 2. P(X) s constant for all classes. 3. P(C) = relatve freq of class C samples c such that p s ncreased=c Such that P (X C) P(C) s ncreased 4. Problem: computng P (X C) s unfeasble! [15] [17]. 3.2.2 Support Vector Machne (SVM) Support vector machne ensures a machne learnng technque on the bass of statstcal learnng theory. It creates a dscrete hyperplane n the descrptor space of the tranng data and compounds are classfed based on the sde of hyperplane located. The advantage of the SVM s that, by use of the so-called kernel trck, the dstance between a molecule and the hyperplane can be calculated n a transformed (nonlnear) feature space, lackng of the explct transformaton of the orgnal descrptors. The radal bass functon kernel (Gaussan kernel) whch s the most commonly used was appled to ths study. The kernel functon s expressed as follows [8]: 2 x x K( x, x ) exp( ) (a) 2 2 In the above equaton (a), the kernel wdth parameters control the ampltude of the Gaussan functon reflectng the generalzaton ablty of SVM. The regularzaton parameter C s censurable for nhbtng transacton between maxmzng the margn and mnmzng the tranng error. In00 recent tmes, partcular attenton has been dedcated to support vector machnes (SVMs) for the classfcaton of dseases. SVMs have frequently been found to provde maxmum classfcaton accuraces than other wdely used pattern recognton technques, such as the 17
maxmum lkelhood and the multlayer perceptron neural network classfers. Table 2 represents and explans the mathematcal formulaton of support vector machne. Table 2 : SVM Mathematcal Formulaton Step 1: Let s assume a supervsed bnary classfcaton problem. Let us consder that the tranng set conssts of N vectors from the -dmensonal feature space d x ( 1,2,..., N). Step 2: A target y { 1, 1} s assocated to each vector x. Step 3: Let us consder that the two classes are lnearly separable. Ths ponts that t s d possble to dscovery at least one hyperplane (lnear surface) defned by a vector w (normal to the hyperplane) and a bas b that could separate two classes wthout errors. Step 4: The membershp decson rule can be based on the functon sgn [f(x)], where f(x) s the dscrmnant functon assocated wth the hyperplane and defned as f ( x) w. x b. (1) In case to fnd such a hyperplane, one should estmate w and so that y ( w. x b) 0, wth 1,2,..., N. (2) Step 5: The SVM approach nvolves n dscoverng the optmal hyperplane that ncreases the dstance between the neghborng tranng sample and the splttng hyperplane. It s possble to express ths dstance as equal to 1/ w wth a smple rescalng of the hyperplane parameters w and b such that y ( w. x b) 1. (3) mn 11,2,..., N Step 6: Consequently, t changes the optmal hyperplane whch can be controlled by the followng soluton of convex quadratc programmng problem 1 2 mn mze : w 2 1,2,..., N. (4) subject : y ( w. x b) 1, Step 7: Ths tradtonally lnear constraned optmzaton problem can be nterpreted (usng a Lagrangan formulaton) nto the followng dual problem: N N N 1 max mze : subject. to : 2 1 1 J 1 N 1 y j y y 0and 0, Step 8: The Lagrange formulzers s ( j ( x. x ) 1,2,..., j N 1,2,..., N. (5) ) represented n (5) can be assessed usng quadratc programmng (QP) methods. The dscrmnant functon assocated wth the optmal hyperplane becomes an equaton dependng both on the Lagrange multplers and on the tranng samples,.e., 18 f ( x) x x b y (. ) (6) s Where s s the subset of tranng samples correspondng to the nonzero Lagrange multpler s. It s worth notng that the Lagrange multplers effectvely weght each tranng sample accordng to ts mportance n determnng the dscrmnant functon. The tranng samples assocated to nonzero weghts are called support vectors. These le at a dstance exactly equal to 1/ w from the optmal separatng hyperplane
4. EXPERIMENTAL RESULTS Ths work s mplemented n Matlab tool. MATLAB (matrx laboratory) s a multparadgm numercal computng envronment and fourth-generaton programmng language. Developed by MathWorks, MATLAB permts matrx manpulatons, employment of algorthms, ncepton of user nterfaces, plottng of functons and data and nterfacng wth programs wrtten n other languages, ncludng C, C++, Java, Fortran and Python. The expermental comparson of classfcaton algorthms are done based on the performance measures of classfcaton accuracy, error rate and executon tme. 4.1 Classfcaton Accuracy Accuracy Accuracy s defned n the terms of correctly classfed nstances dvded by the total number of nstances present n the dataset. Where TP- True Postve, FP- False Postve, TN- True Negatve, FN- False Negatve TP Rate: It s the ablty whch s used to fnd the hgh true-postve rate. The true-postve rate s also called as senstvty. Precson Precson s gven the correlaton of number of modules correctly classfed to the number of entre modules classfed fault-prone. It s quantty of unts correctly predcted as faulty. 19
F-Measure Internatonal Journal on Cybernetcs & Informatcs (IJCI) Vol. 4, No. 4, August 2015 F- Measure s the one has the combnaton of both precson and recall whch s used to compute the score. In the feld of Informaton Retreval the F-measure s habtually used n order to guesstmate the query classfcaton performance. Table 5 represents the performance of classfcaton accuracy measure of the datasets usng classfcaton algorthms such as SVM and Naïve Bayes. Table 5: Accuracy Measure for Classfer Algorthms Algorthms Correctly Classfed Instances (%) Incorrectly Classfed Instances (%) TP Rate Precson F Measure Recall Naïve Bayes 70.96 29.04 0.709 0.809 0.192 0.109 SVM 76.32 23.68 0.763 0.820 0.213 0.173 Fgure 2 represents the accuracy measure and fgure 3 represents the performance measure for the classfcaton algorthms namely Nave Bayes and SVM. From the expermental result, SVM performs best n classfyng process than Naïve Bayes algorthm. Ths chart represented as gven n table 5. 20
Fgure 2: Accuracy measure for Classfcaton Algorthms 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 TP Rate Precson F Measure Recall 0.1 0 Naïve Bayes SVM Fgure 3: Performance measure for Classfcaton Algorthms 4.2 Executon Tme Table 6 represents the executon tme of the classfcaton algorthms 21
Table 6: Executon tme Analyss Algorthms Executon Tme n Seconds Naïve Bayes 1.29 SVM 3.22 3.5 3 2.5 2 1.5 1 0.5 0 Naïve Bayes Executon Tme SVM Executon Tme Fgure : Executon Tme of Classfcaton Algorthms Fgure 4 represents the tme taken for executon process. Naïve Bayes performs wth mnmum perod of executon tme than the other algorthms. Ths chart represented as gven n table 6. Table 7 represents and descrbes the classfcaton of kdney dseases as gven below. Table 7. Classfcaton of Kdney Dseases 22
Classfers Kdney Dsease Naïve Bayes SVM Normal Acute Nephrtc Syndrome Chronc Kdney dsease 428 435 49 45 35 42 Acute Renal Falure 19 19 Chronc Glomerulonephrts 52 42 500 400 300 200 100 0 Naïve Bayes Naïve Bayes SVM Fgure 5: Classfcaton of Kdney Dseases Fgure 5 represents the Kdney dseases classfed by dfferent types of classfcaton algorthms, Naïve Bayes and SVM algorthms. Based on chart analyss, SVM gves the overall best classfcaton result than other algorthm. 23
5. RESULT AND DISCUSSION The algorthm whch has the hgher accuracy wth the mnmum executon tme has chosen as the best algorthm. In ths classfcaton, each classfer shows dfferent accuracy rate. SVM has the maxmum classfcaton accuracy and t s consdered as the best classfcaton algorthm. But Naïve Bayes perform as best wth mnmum executon tme. 6. CONCLUSION In ths research work classfcaton process s used to classfy four types of kdney dseases. Comparson of Support Vector Machne (SVM) and Naïve Bayes classfcaton algorthms s done based on the performance factors classfcaton accuracy and executon tme. From the results, t can be concluded that the SVM acheves ncreased classfcaton performance, yelds results that are accurate, hence t s consdered as best classfer when compared wth Naïve Bayes classfer algorthm. Perhaps, Naïve Bayes classfer classfes the data wth mnmum executon tme. REFERENCE [1] AndrewKusak, Bradley Dxonb, Shtal Shaha, (2005) Predctng survval tme for kdney dalyss patents: a data mnng approach, Elsever Publcaton, Computers n Bology and Medcne 35, page no 311 327 [2] Anu Chaudhary, Puneet Garg,(2014) Detectng and Dagnosng a Dsease by Patent Montorng System, Internatonal Journal of Mechancal Engneerng And Informaton Technology, Vol. 2 Issue 6 //June //Page No: 493-499. [3] Approaches, Knowledge-Orented Applcatons n Data Mnng, Prof. Kmto Funatsu (Ed.), ISBN: 978-953-307-154-1,InTech,http://www.ntechopen.com/books/knowledge-orented-applcatons-ndatamnng/mnng-enrollment-data-usng-descrptve-and-predctve-approaches [4] Crstóbal Romero, Data Mnng Algorthms to Classfy Students, http://sc2s.ugr.es/keel/pdf/specfc/congreso/data%20mnng%20algorthms%20to%20classfy%20 Students.pdf [5] Fadzlah Sraj, Mansour Al Abdoulha, (2011). Mnng Enrollment Data Usng Descrptve and Predctve [6] George Dmtoglou, Comparson of the C4.5 and a Nave Bayes Classfer for the Predcton of Lung Cancer Survvablty [7] Govann Caocc, Roberto Baccol, Roberto Lttera, Sandro Orrù, Carlo Carcass and Gorgo La Nasa, Comparson Between an Artfcal Neural Network and Logstc Regresson n Predctng Long Term Kdney Transplantaton Outcome, Chapter 5, an open access artcle dstrbuted under the terms of the Creatve Commons Attrbuton Lcense, http://dx.do.org/10.5772/53104 [8] Gualter. J. A, Chettr. S. R, Cromp. R. F and Johnson.L. F, (1999) Support vector machne classfers as appled to AVIRIS data, n Summares 8th JPL Arborne Earth Scence Workshop, JPL Pub. 99-17, pp. 217 227. [9] Ian H. Wtten and Ebe Frank.(2005) Data Mnng: Practcal machne learnng tools and technques. Morgan Kaufmann Publshers Inc., San Francsco, CA, USA, 2nd edton 24
[10] Lakshm. K.R, Nagesh. Y and VeeraKrshna. M, (2014) Performance Comparson Of Three Data Mnng Technques For Predctng Kdney Dalyss Survvablty, Internatonal Journal of Advances n Engneerng & Technology, Mar., Vol. 7, Issue 1, pg no. 242-254. [11] Mahesh Mudhol Purushothama Gowda,( 2004) Data Mnng n the Process of Knowledge Dscovery n Dgtal Lbrares, 2nd Conventon PLANNER, Manpur Un., Imphal, 4-5 November, 2004, page no 164-167 [12] Ruben D. Canlas Jr,(2009) Data Mnng In Healthcare: Current Applcatons And Issues, August [13] Tadjudn. S and Landgrebe. D.A, (1999) Covarance estmaton wth lmted tranng samples, IEEE Trans. Geosc. Remote. Sensng, vol. 37, pp. 2113 2118, July [14] Tommaso D Noa, Vto Claudo Ostun, Francesco Pesce, Gulo Bnett, Davd Naso, Francesco Paolo Schena, Eugeno D Scasco,( 2013) An end stage kdney dsease predctor based on an artfcal neural networks ensemble, Elsever Publcaton, Expert Systems wth Applcatons 40, page no 4438 4445 [15] Uffe B. Kjærulff, Anders L. Madsen, (2005) Probablstc Networks an Introducton to Bayesan Networks and Influence Dagrams, 10 May [16] Vjayaran. S, Sudha. S, (2013) Comparatve Analyss of Classfcaton Functon Technques for Heart Dsease Predcton, Internatonal Journal of Innovatve Research n Computer and Communcaton Engneerng Vol. 1, Issue 3, May, page no 735-741 [17] Zhang H.; Su J, Nave Bayesan classfers for rankng. Paper appeared n ECML2004 15 th European Conference on Machne Learnng, Psa, Italy. AUTHORS Dr. S. Vjayaran has completed MCA, M.Phl and Ph.D n Computer Scence. She s workng as Assstant Professor n the School of Computer Scence and Engneerng, Bharathar Unversty, Combatore. Her felds of research nterest are data mnng, prvacy and securty ssues n data mnng and data streams. She has publshed papers n the nternatonal journals and presented research papers n nternatonal and natonal conferences. Mr. S. Dhayanand has completed MSc, n Software Systems. He s currently pursung hs M.Phl n Computer Scence n the School of Computer Scence and Engneerng, Bharathar Unversty, Combatore. Hs felds of research nterest are data mnng and medcal mnng. He has presented research papers n nternatonal, natonal conferences and Symposums. 25