Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 39-7064 An Ensembe Cassfcaton Framework to Evovng Data Streams Naga Chthra Dev. R MCA, (M.Ph), Sr Jayendra Saraswathy Maha Vdyaaya, Coege of Arts and Scence, Combatore, Inda Abstract: Data stream cassfcaton poses many chaenges to the data mnng communty. In ths thess, we address four such maor chaenges, namey, nfnte ength, concept-drft, concept-evouton, and feature-evouton. Snce a data stream s theoretcay nfnte n ength, t s mpractca to store and use a the hstorca data for tranng. Concept-drft s a common phenomenon n data streams, whch occurs as a resut of changes n the underyng concepts. Concept-evouton occurs as a resut of new casses evovng n the stream. Feature-evouton s a frequenty occurrng process n many streams, such as text streams, n whch new features (.e., words or phrases) appear as the stream progresses. Most exstng data stream cassfcaton technques address ony the frst two chaenges, and gnore the atter two. In ths thess, we propose an ensembe cassfcaton framework, where each cassfer s equpped wth a nove cass detector, to address concept-drft and concept-evouton. To address feature-evouton, we propose a feature set homogenzaton technque. We aso enhance the nove cass detecton modue usng the Prncpe component anayss by makng t more adaptve to the evovng Stream and enabng t to detect more than one nove cass at a tme wth heterogeneous technque for nove data s. Comparson wth state-of-the-art data stream cassfcaton technques estabshes the effectveness of the proposed approach. Keywords: Informaton Retreva, Data Cassfcaton, Outer Detecton, Nove Data extracton. Introducton Techncay, bg data anayss s anayss of data mnng and technques.nove mnng s the process of fndng correatons or patterns among dozens of feds n arge reatona databases. Severa types of anaytca software are avaabe: statstca, machne earnng, and neura networks. As Nove contents keeps extendng, the no. of pages crawed by the search engnes s ncreases. Wth such arge amount of data, estmatng the reevant nformaton satsfyng the user query s a chaengng task. Data predcton, Extracton and Agnment of bg data from Nove databases s research area to obtan better mechansm and methodoogy to derve hgh precson and accuracy. Athough many data extracton concepts such as [], [] and [3] have proposed n terature reated to research area but they st ag n some measurement regardng the data mnng propertes ke precson and reca measures etc. Therefore, t s a mandatory to ascertan the sutabe souton for extracton and agnment of the bg data. Another wdespread appcaton of Nove predcton s personazaton, n whch users are categorzed based on ther nterests and tastes [4] [7]. In Nove predcton and Extracton, we face chaenges n preprocessng, custerng, cassfcaton and predcton. In exstng works, [8], [9], predcton mode based on fusng severa predcton modes ke Markov and SVM modes has been utzed, even t fas to reduce the fase postve rate. Ths expotaton has enabed us to consderaby mprove the predcton accuracy. In ths paper, we ntroduce an effcent framework for Nove Data extracton and custerng mechansm to user query obfuscatons to aevate the ssue of scaabty, ambguty and precson n the number of query suggestons (predcton) and Query Resut Records (QRR)[0] as a custers. In addton, the resuts ndcate a dramatc mprovement n predcton tme for our obectve. Moreover, the resuts demonstrate the postve effect of our proposed user specfc custerng mode n reducng the sze of the predcton modes through mut correaton factors Voume 3 Issue, November 04 www.sr.net Lcensed Under Creatve Commons Attrbuton CC BY estmatons wthout compromsng the predcton accuracy. Fnay, we present experments to study the effect of sparsty of pages, tranng parttonng, and rankng on the predcton accuracy. The advantages of ths method nove structure of data resuts through optons for agnng teratve and dsunctve data tems to form Resuts sets of query. The rest of the paper s organzed as foows: Sectons descrbes the reated work of state of art methods about Nove data custerng and extracton wth agnment technque, secton 3 descrbes the overa framework wth Methods and souton to acheve the Nove document custerng. Secton 4 descrbes the expermenta resuts of our method and performance measures wth state-of-the-art methods. Secton 5 concudes the paper and outnes possbe future work.. Reated Works. Data Coaboraton based extracton and Content based Predcton In Bg data Anayss, Coaboratve fterng approaches are the most popuar predcton methods and are wdey adopted n Data coaboraton based extracton []. User-based approaches predct the ratngs of actve users based on the ratngs of ther smar users, and tem-based approaches predct the ratngs of actve users based on the computed nformaton of tems smar to those chosen by the actve user. However, on the Nove, n most of the cases, ratng data are aways unavaabe snce nformaton on the Nove s ess structured and more dverse. Query suggeston s cosey reated to query expanson or query substtuton, whch extends the orgna query wth new search terms to narrow down the scope of the search. But dfferent from query expanson, query suggeston ams to suggest fu queres that have been formuated by prevous users so that query ntegrty and coherence are preserved n the suggested queres [8]. Query refnement s another cosey reated noton, snce the obectve of query refnement s Paper ID: OCT480 0
Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 39-7064 nteractvey recommendng new queres reated to a partcuar query.. Concept based mnng and Cck through Data Anayss Concept based mnng mode [][3] has aso been utzed n bg data communty that anayzes terms on the sentence, document, functon, dependency eve and corpus eves s ntroduced. The concept Incuson dependency custerng agorthm can effectvey dscrmnate between nonmportant terms wth respect to sentence semantcs and terms whch hod the concepts that represent the sentence meanng. The smarty between documents s cacuated based on a new concept ncuson dependency measure. The proposed dependency measure takes fu advantage of measures on the sentence, document, and corpus and functon eves n cacuatng the dependency range between documents by the mportance of dependency dscovery, a method for dscoverng XML functona dependences. Functona and ncuson dependency dscovery s mportant to knowedge dscovery, database semantcs anayss and data quaty assessment. In Cck through data anayss, the most common usage s for optmzng Nove search resuts or rankngs [0], Nove search ogs are utzed to effectvey organze the custers of search resuts by earnng nterestng aspects of a topc and generatng more meanngfu custer abes.. Besdes rankng, cck through data s aso we studed n the query custerng probem []. Query custerng s a process used to dscover frequenty asked questons or most popuar topcs on a search engne. 3. Proposed Methodoogy 3. Estabshng the Indexng of Data Warehouse for Evouton of Data from Dfferent stream A fundamenta too n constructon of text cassfcaton s a st of 'stop' words (stop word st) that s used to dentfy frequent words that are unkey to assst n cassfcaton and hence are deeted durng pre-processng. Currenty, we ony remove Engsh stop words (e.g., and, nto, or w) as source code s amost excusvey wrtten wth Engsh acronyms and comments. T now, many stop word sts have been deveoped for Engsh anguage. Then we use the ceanng fter to remove unnecessary punctuaton characters ke commas or semcoons at the start or end of the token that mght have been nserted at formuas or (for exampe, name= or rech from an expresson ke nt name= rech are changed to name and rech). Speca characters that represent mutpcatons, equas, addtons, subtractons, or dvsons from formuas shoud have been emnated n ths process (e.g., from a =b * c +d ony a, b, c, and d shoud get through). 3. Probem Formuaton The man obectve of the proposed probem s to predct the user specfc Query resuts state through an optmzed custerng for the bg data anayss. The near custerng Suffx tree separates the data, but t maxmzes the dstance between the gven data pont to the nearest data pont of each cass. The tranng data set s gven by D {( x, y),( x, y ),( x, y )}, n x R, y {,} () Voume 3 Issue, November 04 www.sr.net Lcensed Under Creatve Commons Attrbuton CC BY Where, number of tranng data, X Tranng data, y cass abe as or - for x for arge data wth drftng A nonnear functon s adopted to map the orgna nput space R n nto N-dmensona feature space of the arge dataset. x) ( x), ( x),..., ( ) () ( N x The separatng hyper pane s deveoped n ths N- dmensona feature space. Then the custerng functon represented as, y( x) sgn(. ( x) b) (3) Where - weght vector and b- scaar. In order to obtan the optma custerng through ensembe cassfer shoud be mnmzed subect to the foowng constrants y[ ( x ). b], =,. (4) The varabe s the postve sack varabes, necessary for mscassfcaton of data n dfferent custer. 3.3 Determnng a feature Evouton and Feature seecton for data cassfcaton usng ensembe cassfcaton One of the most assumptons of ancent data processng s that knowedge s generated from one, statc and hdden perform from the data evovng n the data streams. However, t s hard to be true for data stream earnng, where unpredctabe changes are key to eventuay happen. Concept drft s sad to occur once the underyng functon that generates nstances changes over tme. The Suffx tree custerng s known to be effcent n custerng arge datasets. Ths custerng s one n a the best and aso the best far-famed unsupervsed earnng agorthms that sove the we-known custerng probem n terms arge data through the steps of bg data communty. The obectve functon s gven n Eq. (5), mn J (, ) = C (5) We have, y [ ( x ) * b] (6) where, C margn parameter, - weght vector, x - tranng data, y - cass abe ( or -) for x, - postve sack varabes; 0,,...,, b scaar, number of tranng data. Obectve functon obeys the prncpe of structura rsk mnmzaton n order to obtan the optma souton wth ess fase postve rate for the data custered. The obectve functon n Eqn (5) can be re-modfed by foowng Legrangan prncpe for the data segmentaton and predcton as, Paper ID: OCT480
L Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 39-7064 (, b,, a, ) C a ( y [ ( x )* b] ) (7) Fgure : Bock Dagram of Framework for Ensembe Cassfcaton Beow equaton expans the smarty assgnment as foows Where, a 0, 0(,,..., ) a, -, On substtutng Eq. (8) n Eq. (7), the dua probem becomes, maxw ( a) a a y y( ( x ), ( x) a maxw ( a),, a a y y k( x, x ) The suffx agorthm ams to partton a group of obects supported ther attrbutes/features, nto no. of feature custers, wherever x may be a predefned or user-defned constant nto x custers Predcton based on the query preferences and query frequency suggestons The predcton of the query reevance s cacuated based on the query preferences and query frequency of the user or communty to the partcuar type of data. Frequency suggeston s empoyed through predcton and equvaence of the system n the data evouton and concept drftng n the data streamng n the network to the server. a Voume 3 Issue, November 04 www.sr.net Lcensed Under Creatve Commons Attrbuton CC BY Determnng tempora probabty and tempora pattern reevance of data to the query Tempora probabty s carred out the densty based custerng technque and ts custer empoyed through the rankng of the document, tempora pattern reevance s aso estmated from the custer n terms of entropy and Eucdean cacuaton. Rankng Based on the Integraton Vaues through foowng process Par wse agnment through Smarty estmaton. Par wse agnment s carred through the rankng based on the anayss and par wse agnment s carred out through the agorthm s based on the observaton that the data vaues beongng to the same attrbute usuay have the same data type and may contan smar strngs, especay snce resuts records of the query for the user query. Hostc agnment based predcton methods. Vertces from the same record are not aowed to be ncuded n the same connected component as they are consdered to come from two dfferent attrbutes of the record. If two vertces from the same record breach ths constrant, a path must exst between the two, whch we ca a breach path. Nested structure Agnment through user specfc custerng Hostc data vaue agnment constrans a data vaue n a Resut set to be agned to at most one data vaue from another Resut set. If a Resut set contans a nested structure such that an attrbute has mutpe vaues, then some of the vaues may not be agned to any other vaues. Therefore, nested structure processng dentfes the data vaues of a Resut set that are generated by nested structures. 4. Expermenta Resuts In ths secton, Expermenta Resuts for query based predcton from bg data wth data evouton and feature evouton were carred out usng Nove data and resuts were performed wth performance system confguratons to perform the data scang and extractng nto the proper custers through suffx tree custerng. Intay extractng the framework has been utzed by tranng, vadaton and testng data for cassfcaton of resuts usng hstorca predcton modes dentfy the resuts set estmaton effcenty and effectvey n arge dataset. The performances of the custerng and cassfcaton are expermented and presented n terms of reatve speed, computatona tme as propertes measure of performance usng the arge data set. 4.Query frequency estmaton and tempora probabty estmaton The tempora predcton states observed from the arge data set are as foows: supervsed data, unsupervsed data and sem-supervsed data. Feature extracton through user query modeng Feature Extracton s empoyed n arge dataset wth data drftng and nformaton retreva wth estmatng varous factors n the query anayss to the arge dataset Paper ID: OCT480
Feature extracton: Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 39-7064 () The data n the bg data s evoved wth severa feature cassfcaton wth nove features estmaton n each sampe such as, y, y, y3, y4 and y5, are extracted by the equaton as foows: y k = k c 5 (9) max c where k=,,..., 5, c k Absoute feature data per one sampe. () The absoute nformaton s cacuated for dfferent sampes gven by, Y 6 =og 0 5 max m m c () Tabe : Parameters of cassfcaton and Predcton of data cassfcaton Parameters Notatons used Vaues Learnng rate Λ 0.0 Scang factor Σ Tabe : Performance Parameters to compute Data Extracton mechansm Parameters Notatons used Vaues Number of teraton I 5000 Order of the poynoma Order 3 Scang factor Σ 4.3 Resut Anayss The proposed framework s mpemented and tested usng dfferent types of datasets usng user specfc custer modeng and mut correaton estmaton. An extensve expermenta study was conducted to evauate the effcency and effectveness of the proposed methodoogy on varous parameters of benchmark nstances and the predcton states are obtaned n the graph User Specfc Custerng has been utzed by the tranng the data through the anaysng the user behavor n the personazaton methods n the teratures. Popuar notons of custers ncude groups wth sma dstances among the custer members, dense areas of the data space, ntervas or partcuar statstca dstrbutons. Custerng can therefore be formuated as a mut-obectve optmzaton probem as t requres to custer based on the dfferent user perspectve. The approprate custerng agorthm and parameter settngs (ncudng vaues such as the dstance functon to use, a densty threshod or the number of expected custers) depend on the ndvdua data set and ntended use of the resuts. Custer anayss as such s not an automatc task, but an teratve process of knowedge dscovery or nteractve mut-obectve optmzaton that nvoves tra and faure. It w often be necessary to modfy data pre-processng and mode parameters unt the resut acheves the desred propertes. Fgure : Estmaton of the proposed framework aganst concept based mnng The foowng parameters are utzed to estmate the performance of the bg data cassfcaton and predcton of data for user queres Mnmze average dameter of custers Ths factor estmates the performance of proposed framework n the cassfyng the data wth concept drft.proposed framework by suffx tree custerng proves the accuracy resuts set wth precson and reca n the custer acheved. Maxmum Lkehood: It s a method of estmatng the parameter of a statstca mode. When apped to a data set and gven a statstca mode, maxmum-kehood estmaton provdes estmates for the mode's parameters. We have proved the performance of system n custerng the query based on the severa factors ncuded n the framework and experment to determne the performance factors wth better resuts. 5. Concuson We have mpemented cassfcaton and nove cass detecton technque for concept-drftng data streams that addresses four maor chaenges, namey, nfnte ength, concept-drft, concept-evouton, and feature-evouton wth ess fase aarm rate and fase detecton rates n many scenaros. We have desgned outer detecton, and nove cass nstances mode, as the prme cause of hgh error rates. We aso propose a better aternatve approach for dentfyng nove cass nstances usng dscrete Gn Coeffcent, and theoretcay estabsh ts usefuness. Fnay, we propose a graph-based approach for dstngushng among mutpe nove casses. However, we adopted some dynamc approach usng drft detecton technque utzng the naïve bayes whch emphaszes many on concept-evouton. In performance evouton, resuts have been obtaned wth mproved effcency n terms of propertes ke probabty determnaton, f measure. As a future work, we pan to enhance the cassfcaton souton for arge data streams n terms optmzaton technques to ensembe cassfer to yed a more accurate resuts wth ess fase postve rate. So we ncorporate prncpe component anayss for cassfer estmaton as an optmzaton to the proposed souton. Voume 3 Issue, November 04 www.sr.net Lcensed Under Creatve Commons Attrbuton CC BY Paper ID: OCT480 3
References Internatona Journa of Scence and Research (IJSR) ISSN (Onne): 39-7064 [] A. Bonnaccors, On the Reatonshp between Frm Sze and Export Intensty, Journa of Internatona Busness Studes, XXIII (4), pp. 605-635, 99. (ourna stye) [] R. Caves, Mutnatona Enterprse and Economc Anayss, Cambrdge Unversty Press, Cambrdge, 98. (book stye) [3] M. Cerc, The Swarm and the Queen: Towards a Determnstc and Adaptve Partce Swarm Optmzaton, In Proceedngs of the IEEE Congress on Evoutonary Computaton (CEC), pp. 95-957, 999. (conference stye) [4] H.H. Croke, Specazaton and Internatona Compettveness, n Managng the Mutnatona Subsdary, H. Etemad and L. S, Suude (eds.), Croom- Hem, London, 986. (book chapter stye) [5] K. Deb, S. Agrawa, A. Pratab, T. Meyarvan, A Fast Etst Non-domnated Sortng Genetc Agorthms for Mutobectve Optmzaton: NSGA II, KanGAL report 0000, Indan Insttute of Technoogy, Kanpur, Inda, 000. (technca report stye) [6] J. Gerads, "Sega Ends Producton of Dreamcast," vnunet.com, para., Jan. 3, 00. [Onne]. Avaabe: http://n.vnunet.com/news/6995. [Accessed: Sept., 004]. (Genera Internet ste) Voume 3 Issue, November 04 www.sr.net Lcensed Under Creatve Commons Attrbuton CC BY Paper ID: OCT480 4