1 Runnng head: A DATA MINING-BASED OLAP AGGREGATION A Data Mnng-Based OLAP Aggregaton of Complex Data: Applcaton on XML Documents Radh Ben Messaoud, Omar Boussad, Sabne Loudcher Rabaséda {rbenmessaoud omar.boussad sabne.loudcher}@unv-lyon2.fr Laboratory ERIC - Unversty of Lyon 2 5 avenue Perre Mendès-France 69676, Bron Cedex France http://erc.unv-lyon2.fr/
2 ABSTRACT Nowadays, most organzatons deal wth complex data havng dfferent formats and comng from dfferent sources. The XML formalsm s evolvng and becomng a promsng soluton for modellng and warehousng these data n decson support systems. Nevertheless, classcal OLAP tools are stll not capable to analyze such data. In ths paper, we assocate OLAP and data mnng to cope advanced analyss on complex data. We provde a generalzed OLAP operator, called OpAC, based on the AHC. OpAC s adapted for all types of data snce t deals wth data cubes modelled wthn XML. Our operator enables sgnfcant aggregates of facts expressng semantc smlartes. Evaluaton crtera of aggregates parttons are proposed n order to assst the choce of the best partton. Furthermore, we developed a Web applcaton for our operator. We also provde performance experments and drve a case study on XML documents dealng wth the breast cancer researches doman. Keywords: OLAP; data warehouse; data mnng; aggregaton; agglomeratve herarchcal clusterng; evaluaton of aggregates, XML documents
3 INTRODUCTION Data warehouses were ntroduced to provde a support enablng to make decsons from huge amounts of data. A data warehouse s an analyss orented structure that stores a large collecton of subject orented, ntegrated, tme varant and non-volatle data (Kmball, 1996; Inmon, 1996). Onlne analytcal processng (OLAP) s a key feature supported by most data warehouse systems. Based on vsualzaton technques (Manats et al., 2005), OLAP tools enable exploraton and navgaton nto multdmensonal data vews, commonly called data cubes, n order to present nterestng nformaton to end users and decson makers. A data cube s a multdmensonal data model used to conceptualze data n a data warehouse (Chaudahur & Dayal, 1997). The data cube contans facts or cells that are measures or values based on a set of dmensons where each dmenson conssts n a set of categorcal descrptors, called attrbutes, and t may be organzed wthn herarchcal structures. Consder for example a retal sales applcaton where the dmensons of nterest may nclude, Costumer, Product, Locaton, and Tme. If the measure of nterest n ths applcaton s sales amount, then an OLAP fact represents the sales measure correspondng to the prevous dmensons accordng to a sngle attrbute n each dmenson. Dmensons often form a herarchy. For nstance, the Tme dmenson may form a day-month-year herarchy, and the Locaton dmenson may form a cty-stateregon herarchy. Dmensons allow dfferent levels of granularty n the warehouse. For example, a regon corresponds to a hgh level of granularty whereas a cty corresponds to a low level of granularty. Classcal aggregaton n OLAP s consdered the process of consoldatng data values nto a sngle and summarzed one by movng from a herarchcal level of a dmenson to a hgher one. Typcally, addtve data are well suted to be aggregated by elementary operatons (Sum, Average, Max, Mn and Count) n a smple computaton of
4 measures. For example, a user wants to observe the sum of sales amount of products accordng to years and regons. Ths aggregaton should use attrbutes to descrbe the targeted facts and make computaton over ther measures. In the recent years, as more organzatons see the web as an ntegral part of ther communcaton and busness, we have been dealng wth a prolferaton of new data formats. These data are complex and qute dfferent and harder to treat than classcal ones. They need new methodologes to be warehoused frst, and then to be analyzed. XML (extensble Markup Language) s provdng some promsng solutons for ntegratng complex nformaton from dfferent sources and warehousng them. Many recent works have proposed some modellngapproaches for XML data warehouses (Golfarell, Rzz & Vrdoljak, 2001; Trujllo, Mora & Song, 2004; Pokornỷ, 2001; Barl & Bellahsène, 2000; Hümmer, Bauer & Harde, 2003; Rusu, Rahayu & Tanar, 2005; Nasss, Rajagopalaplla, Dllon & Rahayu, 2005). The general purpose of these approaches s to desgn or to feed warehouse through the XML formalsm. For nstance, Golfarell et al. (2001) affrm that the use of XML wll become a standard for warehousng heterogeneous and complex data n the next few years. Ths evoluton n the way of warehousng complex data has some drawbacks on modellngand analyss tasks. In fact, classcal OLAP tools are unsutable and unable to deal wth complex data. For example, when treatng mages, sounds, vdeos, texts or even XML documents, aggregatng nformaton wth the classcal OLAP does not make sense. Indeed, we are not able to compute a sum or an average operaton over such knds of data. However, when users analyze complex data, they need more expressve aggregates than those created from elementary computaton of addtve measures. We thnk that OLAP facts representng complex objects need approprate tools and new ways of aggregaton snce we wsh to analyze them. To summarze nformaton about complex data, we should rather gather ther smlar facts nto a sngle group and separate dssmlar facts nto dfferent groups.
5 In ths case, t s necessary to consder an aggregaton by computng both descrptors and measures. Instead of groupng facts only by computng ther measures, we also take ther descrptors nto account to obtan aggregates expressng semantc smlartes. In order to do so, we ntend to couple OLAP wth data mnng to create a new type of onlne aggregaton of complex data. OLAP and data mnng can be vewed as two complementary felds. Assocatng them can be a soluton to cope wth ther respectve defects. In fact, on the one hand, when supported by database systems, OLAP has a powerful ablty to organze vews and structure data adapted to analyss, but t s restrcted to smple navgaton and exploraton of data whch weakens ts analyss power. On the other hand, data mnng s not very powerful for organzng data, but t s known for ts descrptve and predctve power, whch can dscover knowledge from both smple and complex data. The general ssue of couplng data mnng wth database systems was already dscussed and motvated by Imelnsk and Mannla (1996). The authors argue that data mnng sets new challenges to database technology. Ther combnaton wll lead to a second generaton of database systems able to manage KDD (Knowledge Dscovery n Databases) applcatons just as classcal ones manage busness applcatons. Furthermore, a data cube structure can provde a sutable context for applyng data mnng methods. More generally, the assocaton of OLAP and data mnng allows elaborated analyss tasks exceedng the smple exploraton of a data cube. Our dea s to take advantage from OLAP as well as data mnng technques and to ntegrate them to the same analyss framework n order to analyze complex objects. In spte of the fact that both OLAP and data mnng were consdered two separate felds for a long, several recent works proved the capablty of ther assocaton to provde nterestng analyss process. In addton to these works, we have already proposed n (Messaoud, Boussad & Rabaséda, 2004) a new OLAP
6 operator, called OpAC (Operator for Aggregaton by Clusterng), that combnes OLAP wth an automatc clusterng technque. We use the Agglomeratve Herarchcal Clusterng (AHC) as an aggregaton strategy for complex data. We proved the nterest of ths new operator and ts effcency n creatng semantc aggregates over an mages data cube. More generally, the aggregates provded by OpAC gve nterestng knowledge about the analyzed doman. In ths paper, we propose a generalzaton of our operator whch enables to deal wth all types of data by handlng a data cube modeled and fed drectly by XML sources. In fact, snce XML s able to represent and structure complex objects collected from dfferent sources and whch have dfferent formats (Darmont, Boussad, Bentayeb, Rabaséda & Zellouf, 2003), adaptng OpAC to XML wll lead to a consderable generalzaton of ts analyss capablty. In order to valdate ths generalzaton on a real world doman, we base our current study on screenng mammography data taken from the breast cancer researches. We have structured these data as XML documents and have modeled them on a multdmensonal data cube. Furthermore, we also propose some evaluaton crtera that support the results of our operator. These crtera am at assstng the user and helpng hm/her to choose the best partton of aggregates that wll ft well wth hs/her analyss requrements. The development of ths paper s organzed as follows. In the second secton, we expose a state of the art of works that combne OLAP and data mnng. In the thrd secton, we present an overvew of our approach. We also ntroduce the general context, the XML screenng mammography data cube, and the objectves of our operator. In the fourth secton, we develop a formal background of our approach. The ffth secton s a presentaton of the crtera we propose to evaluate the results of our approach. In the sxth secton, we descrbe the archtecture of a Web platform, called MnngCubes, whch we have developed to valdate our generalzed approach. We also acheve some experments concernng the performance and the tme processng of ths Web applcaton. In the seventh secton, we
7 propose a case study on the XML documents that represent a screenng mammography data cube. Fnally, n the eghth secton, we draw conclusons from ths work and propose some future research drectons. RELATED WORK TO COUPLING OLAP AND DATA MINING The major dffculty of combnng OLAP and data mnng s that tradtonal data mnng algorthms are mostly desgned wth tabular datasets organzed n ndvdualsvarables form (Fayyad, Shapro, Smyth & Uthurusamy, 1996). Therefore, multdmensonal data are not suted for these algorthms. Nevertheless, a lot of prevous works motvated and proved an nterest of couplng OLAP wth data mnng methods. We dstngush three major approaches n ths feld. The frst approach tres to extend the query language of decson support systems n order to acheve data mnng tasks. DBMner system, proposed by Han (1998), summarzes ths approach. Some extended OLAP operators perform data mnng methods such as assocaton, classfcaton, predcton, clusterng and sequencng. Han defnes the OLAP Mnng as a new concept that ntegrates OLAP technology wth data mnng technques and allows to perform analyss on dfferent portons and levels of abstracton of a data cube. He also ntroduces the OLAM (On-Lne Analytcal Mnng) as a process of extractng knowledge from multdmensonal databases. He expects that, n the future, OLAM wll be a natural addton to OLAP technology that enhances the power of multdmensonal data analyss. Chen, Dayal and Hsu (2000) dscover behavor patterns by mnng assocaton rules about customers from transactonal e-commerce data. They extend OLAP functons and use a dstrbuted OLAP server wth a data mnng nfrastructure and the resultng assocaton rules
8 are represented n partcular cubes called Assocaton Rule Cubes. Gol and Choudhary (1998) thnk that dmenson herarches can be used to provde nterestng nformaton at multple concept levels. Ther approach summarzes nformaton n a data cube, extends OLAP operators and mnes assocaton rules. Some other works consst n ntegratng mnng functons n the database system usng SQL. Chaudhur (1998) argues that data mnng promses a gant leap over OLAP. He proposes a data mnng system based on extendng SQL and constructs data mnng methods over relatonal databases. Chaudhur, Fayyad and Bernhardt (1997) developed a clent-server mddleware that performs a decson tree classfer over MS SQL Server 7.0. Meo, Psala and Cer (1996) propose a model that enables a unform descrpton for the problem of dscoverng assocaton rules. The model also extends SQL and provdes an operator called MINE RULE. The second approach conssts n adaptng multdmensonal data nsde or outsde the database system and apples classcal data mnng algorthms on the resultng datasets. Ths approach can be vewed accordng to two strateges. The frst one conssts n takng advantage from multdmensonal database management system (MDBMS) n order to help the constructon of learnng models. In (Laurent, Bouchon-Meuner, Doucet, Gançarsk & Marsala, 2000), the authors propose a cooperaton between Oracle Express and a fuzzy decson tree software (Salammbô). Ths cooperaton allows transferrng learnng tasks, storage constrants and data handlng to the MDBMS. The second strategy transforms multdmensonal data and makes them usable by data mnng methods. For nstance, Pnto et al. (2001) ntegrate multdmensonal nformaton n data sequences and apply on them the dscovery of frequent patterns. In order to apply decson trees on multdmensonal data, Gol and Choudhary (2001) flatten data cubes and extract contngency matrx for each dmenson at each constructon step of the tree. Chen, Zhu and Chen (2001) thnk that OLAP should be
9 adopted as a pre-processng step n the knowledge dscovery process. In the same context, Maedche, Hotho and Wese (2000) combne databases wth classcal data mnng systems by usng OLAP engne as nterface and treat telecommuncaton data. In ths nterface, OLAP tools create a target data set to generate new hypotheses by applyng data mnng methods. Tjoe and Tanar (2005) propose a method for mnng assocaton rules n data warehouses. Based on the multdmensonal data organzaton, ths method s capable of extractng assocatons from multple dmensons at multple levels of abstracton by focusng on measurements of summarzed data. In order to do ths, the authors propose to prepare multdmensonal data for the mnng process accordng to four algorthms: VAvg, HAvg, WMAvg, and ModusFlter. These algorthms prune all rows n the fact table whch have less than the average quantty and provde an ntalzed table. The latter table s used next for mnng both on non-hybrd (non-repeatable predcate) and hybrd (repeatable predcate) assocaton rules. Fu (2005) proposes an algorthm, called CubeDT, for constructng decson tree classfers based on data cubes. Ths algorthm works on statstc trees whch are representatons of multdmensonal data especally sutable for the constructon of decson trees. The thrd approach s rather based on adaptng data mnng methods and applyng them drectly on multdmensonal data. Palpanas (2000) thnks that adaptng data mnng algorthms s an nterestng soluton to provde elaborated analyss and precous knowledge. Parsaye (1997) clams that decson-support applcatons must consder data mnng wthn multple dmensons. He proposes a theoretcal OLAP Data Mnng System that ntegrates a multdmensonal dscovery engne n order to perform dscovery along multple dmensons. Sarawag, Agrawal and Megddo (1998) propose to ntegrate a multdmensonal regresson module, called Dscovery-drven, n OLAP servers. Ths module gudes the
10 user to detect relevant areas at varous herarchcal levels of a cube. In (Sarawag, 2001), the author proposes another tool called Dff. It detects both relevant areas n a data cube and the reasons of ther presence. The same approach was adopted by Favero and Robn (2001) to generate quanttatve analyss reports from data cubes. They ntegrate n a platform, called HYSSOP, a content determnaton component based on data mnng methods. Imelnsk, Khachyan and Abdulghan (2002) propose a generalzed verson of assocaton rules called Cubegrades. The authors clam that assocaton rules can be vewed as the change of an aggregate's measure due to a change n the cube's structure. They also ntroduce CGQL language for queryng the Cubegrades. Dong, Han, Lam, Pe and Wang (2001) enhanced the Cubegrades and ntroduced constraned gradent analyss. Ther proposton focuses on extractng pars of cube cells that are qute dfferent n aggregates and smlar n dmensons. Instead of dealng wth the whole cube, constrants on sgnfcance, probablty, and gradent are added to lmt the search range. These prevous works have proved that assocatng data mnng to OLAP s a promsng way to nvolve elaborated analyss tasks. They affrm that data mnng methods are able to extend OLAP analyss power. In addton to these works, we have proposed n (Messaoud et al., 2004) another contrbuton to ths feld by developng an Operator for Aggregaton by Clusterng called OpAC. Besdes enhancng classc OLAP wth a clusterng method, ths operator also couples OLAP and data mnng n order to deal wth complex data n multdmensonal context. We have shown n (Messaoud et al., 2004) the nterest of applyng our approach on a cube of mages fles, and we have proven the semantc sgnfcance of ts facts' aggregates. In ths paper, we propose to generalze our operator and to adapt t n order to handle XML data cubes and apply t on the breast cancer doman.
OVERVIEW AND OBJECTIVES OF OUR APPROACH 11 Nowadays, n almost any area of scentfc research or busness applcaton doman, there s an ncreasng avalablty of data. These data are not only becomng larger n sze, but also n complexty. Data have dfferent types, come from heterogeneous sources, and are supported by dfferent formats. Analyzng and extractng features from these data s therefore a complex task. To learn from these data, we need analyss tools that can make sense from them. OLAP s a powerful mean of explorng and extractng pertnent nformaton from data through multdmensonal analyss. In ths context, data are organzed n multdmensonal vews, commonly called data cubes. The constructon of a data cube targets a precse analyss context and descrbes real world facts. For nstance, these facts can be vewed accordng to several dmensons such as costumer, Product, Locaton, and Tme. The choce of these dmensons closely depends on the user and the way (s)he would lke to treat the facts analyss. In addton to dmensons, an OLAP fact s also evaluated by a set of quanttatve measures such as revenue, proftablty, and customer retenton. By organzng nformaton nto dmensons and measures, OLAP allows us to follow trends n a customer realm, spot anomales across products, compare annual sales n a regon by product lne or customer type. Furthermore, a dmenson s usually organzed accordng to several herarches defnng varous levels of data granularty. Each herarchal level contans a set of attrbutes (also called members), and each attrbute may conceptually nclude other attrbutes from the herarchcal level mmedately below. For example, as the Locaton dmenson may form the herarchy cty-stateregon, the attrbute Calforna from the state level could nclude Los Angeles, Long Beach, Oakland, San Dego, and Santa Monca as attrbutes from the cty level. Therefore,
12 by movng from a herarchcal level to a hgher one, attrbutes are gathered together nto aggregates. In consequence, measures related to the attrbutes are computed and so nformaton s summarzed to a small number of sets. In many applcaton domans, a user s sometmes faced to take crtcal decsons. Analyss tools should be effcent. For nstance, aggregated measures need to reflect sgnfcant values of a set of facts sharng relaton deeper than a smple order of membershp. In medcne doman, experts need to see aggregates of objects, lke tumors or any other pathology, that have a maxmum number of common medcal propretes. For example, n the breast cancer research feld, assocatng malgn and bengn patents n the same aggregate can cause dramatc consequences. In the recent years, clncal data were wdely treated by data mnng technques n medcne outcome analyss (Chen & Lu, 2005; Hu et al., 2005). In fact, medcne s one of the most mportant applcaton domans where a lot of efforts are needed for structurng and analysng data n order to enhance the medcal sound researches. We also propose to refer our study to an XML data cube whch descrbes suspcous regons of tumors detected on mammography screens. We constructed ths cube from the Dgtal Database for Screenng Mammography (DDSM 1 ). In the followng, we present the DDSM and the XML data cube of the screenng mammography data. Presentaton of the DDSM The DDSM s bascally a resource used by the mammographc mage analyss research communty n order to facltate sound research n the development of analyss and learnng algorthms (Heath, Bowyer, Kopans, Moore & Jr, 2000). The database contans approxmately 2 600 studes, where each study corresponds to a patent case.
Fgure 1. An Example of a patent case study from the DDSM A Data Mnng-Based OLAP Aggregaton 13 A patent case s a collecton of mage and text fles contanng several medcal nformaton collected along a screenng mammography exam. The DDSM contans four types of patent cases: Normal, Bengn wthout callback, Bengn, Cancer. Normal type are mammograms from screenng exams that were read as normal and had a normal screenng exam. Bengn wthout callback cases are exams that had an abnormalty that was noteworthy but dd not requre the patent to be recalled for any addtonal workup. In Bengn cases, somethng suspcous was found and the patent was recalled for some addtonal workup that resulted n a bengn fndng. Cancer type corresponds to cases n whch a proven cancer was found. As shows Fgure 1, a case conssts of a set of text and mage fles. There are an cs fle (ASCII format) whch descrbes general nformaton about a patent, four LJPEG scanner fles (mage compressed wth lossless JPEG encodng), and zero to four OVERLAY fles. Only cases havng suspcous regons n ther scanner mages are assocated to overlay fles. Normal cases are not. An overlay fle contans nformaton about the locaton, the
14 subtlety value, and a spatal descrpton of the marked suspcous regons. These nformaton are specfed by an expert mammography radologst. The XML Cube of the Screenng Mammography Data Snce a patent study s composed by several data formats and presented on heterogeneous supports, we consder t a complex object. To warehouse and analyze such complex objects, frst, we need to structure them and make them homogeneous as well as possble. In order to do so, we use XML to represent these complex data of screenng mammography and model them n a data cube. Bascally, XML s consdered as a partcular standard syntax for the exchange of semstructured data. The structure of XML, composed of nested custom defned tags, can descrbe the meanng of the content tself. XML documents can also be assocated and valdated aganst ether a Document Type Defnton (DTD) or an XML Schema. Both of them allow descrbng the structure of an XML document and to constrant ts content. Nowadays, many works addressed methodologes based on XML for multdmensonal desgn of data warehouses n order to ntegrate nformaton from dfferent sources (Golfarell et al., 2001; Trujllo et al., 2004; Pokornỷ, 2001; Barl & Bellahsène, 2000; Hümmer et al., 2003; Rusu et al., 2005; Nasss et al., 2005). Snce a large complex amount of data s needed n a decson makng process, the mportance of ntegratng XML n data warehousng envronments s becomng ncreasngly hgh. Accordng to Golfarell et al. (2001), usng XML sources for desgnng and feedng data warehouse systems wll become a standard n the next few years. Furthermore, as XML source are becomng wdely employed, we naturally expect mportant evolutons of query languages to extract knowledge from them for decson supports (Termer, Rousset & Sebag, 2002; Braga, Camp, Cer, Klemettnen & Lanz, 2003; Feng & Dllon, 2005).
15 In the case of the screenng mammography data, an OLAP fact corresponds to a suspcous regon (abnormalty) detected by an expert. The set of collected facts concerns only Bengn, Bengn wthout callback, and Cancer patent cases. Normal cases are not concerned snce they do not contan suspcous regons. As shows the conceptual model n Fgure 2, a suspcous regon can be analyzed accordng to several axes: the leson type, the assessment code, the subtlety, the pathology, the date of study, the dgtzer, the patent age, etc. A suspcous regon s measured by the boundary length of ts suspcous regon. We have also added the number of regons havng the same abnormalty per patent as a derved measure to the data cube model. Fgure 2. Conceptual model of the screenng mammography data cube The conceptual model of the screenng mammography data cube s descrbed wth an XML Schema. An nstance of ths XML Schema s presented by the XML document of Fgure 3. The fact s assocated to the root element of the XML schema, whereas ts dmensons correspond to sub-elements. The measures of a fact are attrbutes n the root element, and the attrbute value of each dmenson s an attrbute n the element correspondng
16 to that dmenson. The screenng mammography data cube contans a collecton of 4 686 XML documents, where each document corresponds to an OLAP fact. Fgure 3. Example of an XML document from the screenng mammography data cube Objectves of our Approach In OLAP context, herarchcal structure of a dmenson nduces sets of attrbutes organzed accordng to the logcal order of membershp. Through a dmenson, a classcal OLAP aggregaton computes measures of facts and gathers these facts nto groups accordng to the herarchcal order of ther attrbutes n that dmenson. For example, n the screenng mammography data cube, accordng to the age class of patents, we can buld aggregates of suspcous regons as those of Fgure 4. In ths example, we can note that, n a sngle aggregate, detected regons do not have relevant common medcal propretes. They have dfferent forms and lengths of boundares. We also note that regons of a sngle aggregate can
17 have dfferent types of leson. Some of them can represent bengn tumors whle some others are cancer. For example, accordng to expert annotatons, suspected regons (c), (e) and (g) of $40$ to $49$ years old patents represent cancer tumors whereas the rest of regons are bengn. In the aggregate $50$ to $59$ years old patents, an expert declares that only regons (b) and (c) are cancer. Ths classcal aggregaton presented above s fully establshed n the conceptual step of the data cube. Therefore, t does not provde to breast cancer experts sgnfcant relatons between suspcous regons. Fgure 4. Example of classcal OLAP aggregaton We wsh to buld aggregates of objects havng smlar medcal propretes. In the case of the screenng mammography data cube, we would lke to construct more homogenous aggregates of suspected regons of tumors. These aggregates should reflect relatons between objects and help experts to extract knowledge from ther common propretes. The man dea of our operator OpAC s to explot the cube's facts descrbng complex objects n order to provde over them a more sgnfcant aggregaton. In order to do so, we use a clusterng method and automatcally hghlght aggregates semantcally rcher than those provded by the current OLAP operators. So the clusterng method provdes a new OLAP aggregaton concept. Ths aggregaton provdes herarchcal groups of objects resumng nformaton and enables navgaton through levels of these groups. Exstng OLAP tools, lke
18 Slcng operator, can create new restrcted aggregates n a cube dmenson, too. Therefore, these tools always need a handmade assstance, whereas our operator s based on a clusterng algorthm that provdes automatcally relevant aggregates. Furthermore, wth classcal OLAP tools, aggregates are created n an ntutve way n order to compare some measure values, whereas OpAC creates sgnfcant aggregates that express deep relatons wth the cube's measures. Thus, the constructon of such aggregates s nterestng to establsh a more elaborated on lne analyss context. Accordng to the above objectves, we choose the AHC as an aggregaton method. Our choce s motvated by the fact that the herarchcal aspect consttutes a relevant analogy between AHC results and herarchcal structures of dmensons. The objectves and the results expected for OpAC match perfectly wth AHC strategy. Furthermore, AHC adopts an agglomeratve strategy that starts by the fnest partton where each ndvdual s consdered a cluster. Therefore, OpAC results nclude the fnest attrbutes of a dmenson. Moreover, AHC s compatble wth the exploratory aspect of OLAP. Its results can also be reused by classcal OLAP operators. In fact, AHC provdes several herarchcal parttons. By movng from a partton level to a hgher one, two aggregates are joned together. Conversely, by movng from a partton level to a lower one, an aggregate s dvded nto two new ones. These operatons are strongly smlar to the classcal operators Roll-up and Drll-down. AHC s a well suted clusterng method to summarze nformaton nto OLAP aggregates from complex facts.
FORMAL BACKGROUND OF OUR APPROACH 19 Indvduals and Varables of the Clusterng Algorthm Ths formalzaton defnes domans of ndvduals and varables of the clusterng problem. Note that these domans are extracted from a multdmensonal envronment. Thus, we should respect some constrants to ensure the statstcal and logcal valdty of the extracted data. Let Ω be the set of ndvduals, and Σ be the set of varables. We also assume that: C s a data cube havng d dmensons and m measures. Accordng to Fgure 2, the XML screenng mammography data cube conssts of nne dmensons and two measures, n ths case d = 9 and m = 2 ; D, K, D,, D are the dmensons of C. For example, n Fgure 2, 1 K d Subtlety dmenson corresponds to D 3 ; M, K, M q,, M are the measures of C. For example, n Fgure 2, Regon 1 K m length corresponds to M 1, and Boundary length corresponds to M 2 ; { 1K,,d}, the dmenson D contans n herarchcal levels. For nstance, Patent dmenson ( D 8 ) of Fgure 2 s composed of two herarchcal levels. So, we note n = 8 2 ; hj s the th j herarchcal level of D, where { 1K, } j,n ; j { 1K,, } n, the herarchcal level h j contans l j attrbutes (or members); g jt s the t th attrbute of j h, where { 1K, } t,l j ; G ( h j ) s the set of attrbutes of h j.
Let suppose that we ntend to aggregate attrbutes from level h j. So the user may choose the dmenson D, the herarchcal level h j n D, and even select ndvduals n G ( h j ). We assume that selected attrbutes are elements of Ω. Therefore, we defne the set of ndvduals as follows: j { g, K, g, g } Ω G ( h ) = 1 K, (1) j Now, we adopt the followng notatons: jt jl j 20 s a meta-symbol ndcatng the total aggregate of a dmenson; q { 1K,,m}, we defne the measure M q as the functon: M q : G R. As shows Formula (2), G s the set of d-tuples of all the herarchcal level's attrbutes of the cube C ncludng the total aggregates of dmensons: G = d = 1 G( hj ) 123 j { 1, K, n } G = G( h1 j ) 123 j { 1, K, n1 } { } { } K G( hj ) 123 j { 1, K, n } { } K G( hdj ) 123 j { 1, K, nd } { } (2) For example, for the data cube of Fgure 2, by usng the above notatons, we can say that: M ( calcfcaton, 2,, K, ) ponts out the aggregated value of the length of 1 all suspcous regon havng calcfcaton as leson type and 2 as subtlety code; M 2(, K,, Patent between 50and 59 years old, lumsyslaser ) ponts out the number of suspcous regons of patents between 50 and 59 years old, scanned by a lumsys laser dgtzer. Remnd that the objectve of OpAC s to establsh a semantc aggregaton va a clusterng technque on real data cube facts. In order to do so, we adopt the cube measures as
quanttatve varables descrbng the ndvduals of Ω. However, t s necessary to satsfy two fundamental constrants on varables: Frst constrant. Herarchcal levels belongng to the dmenson D whch s retaned for the ndvduals can not generate varables. In fact, descrbng an ndvdual by a property whch contans t does not make logcal sense. Conversely, a varable whch specfes a property of an ndvdual would only descrbe ths one; 21 Second constrant. In a dmenson, only one herarchcal level should be selected to generate varables. Ths constrant enables the ndependence of varables. In fact, a value taken by an attrbute from a herarchcal level can be calculated from attrbutes' values belongng to the lower level. Snce Ω s selected, we formulate the possble extracted set of varables Σ as defned n Formula (3): { 1, K, l } V / t j Σ V ( gjt ) = M q, K,, gjt,, K,, { { gsrv,, K, 123 { } { } { } j 1, K, n j 1, K, n r 1, K, n j s j wth s, r s unque for each s, v 1, K, lsr, and q { } { 1, K, m} (3) A user can defne the set of varables by selectng dmensons h sr, and measures precse attrbutes out by the user. D s, herarchcal levels M q. In order to acheve precse analyss tasks, a user may also select g srv n h sr. The selecton of g srv depends naturally on the objectves carred
The Agglomeratve Herarchcal Clusterng Algorthm A Data Mnng-Based OLAP Aggregaton Once ndvduals and varables are selected, we can run the AHC algorthm. We note X the ndvduals-varables table. X s a ( n, p) matrx. Its rows represent ndvduals of Ω, and ts columns represent varables of Σ. We suppose that n s the number of ndvduals, and p s the number of varables. Dssmlartes between all pars of ndvduals are pre-computed. Thus a ( n, n) dssmlarty matrx S s constructed. The dssmlarty of two ndvduals s computed accordng to a dstance functon. A lot of dstance can be used, such as the Eucldan dstance. The general term of S s s j, whch corresponds to the dstance between the 22 ndvduals and j. The greater s j s, the less smlar ndvduals and j are. We sum up the AHC algorthm by the followng steps: Step 1. The n ndvduals of X are assgned nto n dstnct clusters ndexed by { A A, } 1, 2K A n ; Step 2. Two dstnct clusters A and dssmlarty measure s the smallest; Step 3. The two clusters A and A j are pcked up such that ther Aj are merged nto a new cluster A n+ 1. At each step two clusters are merged to form a new cluster. Therefore, the number of clusters s reduced by one; Step 4. Step 2 and 3 are repeated untl the number of obtaned clusters s reduced to a requred number n c, or the smallest dssmlarty value between clusters s dropped to a lower threshold.
In the specfc context of our operator OpAC, t s up to the user to choose the number n c of clusters he requres to see at the end of the AHC algorthm. Else, n a default stuaton, the AHC algorthm s stopped when t attends a sngle cluster. 23 EVALUATION OF AGGREGATES Recall that we propose to use AHC as an aggregaton operator over the attrbutes of a cube dmenson. For n ndvduals to classfy, the AHC generates n herarchcal parttons. Lke almost all unsupervsed mnng methods, the man defect of AHC s that t does not gve mplct evaluaton of ts results. In partcular, we do not have any ndcator about provded parttons of clusters. Therefore, t s qute tedous to choose the best partton suted wth analyss objectves. Furthermore, the choce of the best partton s more dffcult when we deal wth a great number n of ndvduals. Usually, t s the expert who decdes about the number of aggregates that corresponds both to the context and to the goal of hs analyss. In data mnng lterature, many efforts have provded a set of statstcal measures for cluster qualty evaluaton. We emphasze that n our current study, the terms cluster and class refer to an OLAP aggregate provded by our operator. Note that unsupervsed clusterng methods lack a unversal crteron of cluster qualty. Any measure of cluster qualty n ths feld closely depends on the way t s computed. It also depends on the orentatons of user's analyss (Lamrel, Franços, Shehab & Hoffmann, 2004). Hence, for our operator, we propose to use more than one qualty crteron. The comparson of many crtera seems mandatory n order to study the qualty of the resultng aggregates and to decde about the best partton accordng to user's requrements. In the followng, we present the ntra and nter-clusters nertas (Lebart, Morneau & Fénelon, 1982) and the Ward's method (Ward, 1963) that we used as crtera to measure the qualty of aggregates obtaned by OpAC. In addton to these two crtera, we also propose a
new crteron based on the separablty of classes (Zghed, Lallch & Muhlenbach, 2002). In order to formulate these crtera, we assume the followng notatons: = { ω, ω, 2, } Ω s the set of ndvduals to cluster; 1 K ω n each ndvdual takes the weght P (ω), and t s descrbed by p numercal varables V, V, 1 2 K, Vp ; let { 0,, n 1} k K be the ndex of AHC teratons (or parttons). k = 0 corresponds to the ntal AHC partton where each ndvdual represents a sngle cluster. In general, an teraton k corresponds to a partton wth clusters; n teraton k, clusters A and n k A j are merged together, and we move from the partton k 1 to the partton k. A, A, 1 2 K, A n k represents the current partton of Ω ; 24 n s the sze of the cluster A,.e. the number of ndvduals n A ; {,, n k} 0K, the cluster A takes the weght P( A ) = P(ω) ; ω A 1 G( A ) = P( ω ) V ( ω) s the gravty center of A ; P( A ) ω Ω ω A G = P( ω ) V ( ω) s the gravty center of Ω ; d s the Eucldan dstance, and 2 d s the Squared Eucldan dstance. Intra and Inter-Clusters Inertas consst of: These crtera derve from the classcal measures of nerta (Lebart et al., 1982). They
25 mnmzng the ntra-cluster dstances,.e. the dstance between ndvduals wthn a cluster; maxmzng the nter-cluster dstances,.e. the dstance between the gravty's centers of the clusters. For a gven subset of ndvduals I ( A ) = P( ω ) d( V ( ω), G( )) (4) A ω A A, the ntra-cluster nerta s defned as: The total ntra-clusters nerta of a partton k s defned by the sum of ts ( n k) subsets' nerta: I n = k nt ra k) I ( A ) = 1 ( (5) The nter-clusters nerta s defned by the weghted sum of dstances between the gravty's center of Ω and the gravty's centers of all the subsets A of the partton k. I n = k nt er k) P( A ) d( G( A ), G) = 1 ( (6) Accordng to the theorem of Huygens, for each partton, the sum of the two nertas s constant and equal to the nerta of Ω. k {,, n 1 }, I ( k) + I ( k) = I ( Ω) K (7) 0 nt ra nt er The ntra-cluster nerta (respectvely nter-clusters nerta) s an ncreasng (respectvely decreasng) functon accordng to the ndex of parttons k. Remember that the teraton k corresponds to a partton wth ( n k) aggregates. Therefore, the ntra-cluster nerta s a decreasng functon accordng to the number of aggregates. Whle movng from a partton to another, a remarkable break pont of the ntra or nter-clusters nerta wll be an ndcator n the choce of the number of aggregates. Through these crtera, we help the user to attend a better compromse between the mnmzaton of the ntra-clusters nerta, the
26 maxmzaton of the nter-clusters nerta, the number of aggregates, the sgnfcance of the aggregates, and the analyss' objectves. The ntra and nter-clusters nertas may present some lmts snce they have a monotonous general trend. We also propose to use the Ward's method whch s another way of evaluatng the AHC result's by measurng ts mergng cost when movng from a partton to another. Ward s Method The Ward's methods, proposed n (Ward, 1963), constructs a crteron that consders what happens to the sum of squared devatons from the gravty centers of two merged clusters A and A j. Ths mergng cost turns to calculate the Squared Eucldan dstance between the gravty center's of the merged clusters weghted accordng to ther respectve szes at each AHC teraton. The formula of ths crteron s wrtten as follows: n n j 2 W ( A, Aj ) = d ( G( A ), G( Aj )) (8) n + n j At each AHC teraton, ths crteron measures varaton of nternal nerta when two clusters are merged together. Recall that the am s to fnd a partton where ts clusters are as homogenous as possble. Ths leads to mnmze the nternal nerta of clusters. Therefore, when the Ward's method provdes a hgh crteron at teraton k, t mples a great varaton of nternal nerta when movng from a partton k 1 to a partton k. Ths varaton s qute an ndcator that helps users to prefer the prevous partton k 1 whch corresponds to ( n k +1) aggregates. In general, the Ward's method provdes more than one relevant varaton n a herarchcal clusterng. Once agan, t s up to users to choose the best partton that provdes the best soluton to the analyss' objectves.
27 Note that the two prevous crtera are manly related to the prncple of nerta whch measures the homogenety of clusters. In order to provde a complementary way of evaluatng aggregates, we propose a new alternatve crteron that rather measures the qualty of aggregates accordng to the proprety of separablty of classes (Zghed et al., 2002). Separablty Based Crteron Ths crteron s derved from the method of separablty of classes bascally ntroduced by Zghed et al. (2002). Ths crteron starts by constructng a neghborhood graph for the whole set of objects to aggregate. A neghborhood graph, also called a proxmty graph, s a vsual presentaton whch dsplays the overall arrangement of ndvduals n ther space representaton. In such a graph, ndvduals are presented by ponts, and two ponts are connected by an edge f they are, by a certan measure, close together. Specfcally, two ponts are lnked together f there are no other ponts n a certan forbdden regon defned by these two ponts. The Gabrel graph s a partcular case of neghbourhood graphs proposed n (Gabrel & Sokal, 1969). It has been studed n the feld of classfcaton as a way to edt and condense large data sets. In the Gabrel graph, two ponts A and B are connected f ther dametral sphere (.e. the sphere such that AB s ts dameter) does not contan any other ponts. Fgure 5 (a) shows a plane representaton of a Gabrel graph constructed on a set of objects descrbed by two varables X 1 and X 2.
Fgure 5: Prncple of the separablty based crteron A Data Mnng-Based OLAP Aggregaton 28 We assume that g Ω s the Gabrel graph constructed on the whole set Ω of ndvduals. At each AHC teraton { 0,, n 1} k K, our crteron conssts n buldng for each constructed cluster A ( {,, n k} 1K ) ts own Gabrel graph noted g A. Remark that: n k U{ g A } gω = 1 In fact, n a partton of ndvduals, the unon of sub-graphs of ts clusters ( {,, n k} 1K ) does not correspond to the whole graph of Ω. Let j e, also noted { } ω ω j, be the edge that connects two ndvduals A ω and n a neghborhood graph. Each edge e j can be assocated to a weght P ( e j ) accordng to the opposte Eucldean dstance that separates ts connected ponts ω and ω j. ω j
29 P( e j ({ ω }) 1 ) = P ω j = (9) d( ω, ω ) j The weght assocated to edges allows to quantfy the mportance of each connecton n a neghborhood graph. In fact, two ponts separated by a large dstance are easly separable, so ther connecton s relatvely weak. Therefore, two close ponts are less separable, and ther connecton s qute strong. In a smple case, we can also consder that all connectons n a neghbourhood graph have the same separablty level. Hence, we assocate the same weght ( P ( ) = 1) for all the edges of the graph. e j For each AHC teraton, the separablty based crteron conssts n computng the sum of new bult connectons for the Gabrel graphs of clusters A ( { 1K,, n k} ). Let k ξ be the set of the new bult edges at teraton k of the AHC. For example, accordng to Fgure 5, at the teraton k = 3, the cluster 2 s merged wth the cluster {3,4}. The new bult connectons n ths case are { 2 3} and { 4 } = { 2 3},{ 2 4 } 3 ξ. Let (k) 2. Therefore, we note J be the sum of new connectons of Gabrel graphs bult at teraton k. J (k) s wrtten accordng to the followng formula: J ( k) = P( e) (10) k e ξ Our crteron ams at evaluaton of separablty of clusters for each AHC partton. Two clusters are more separable when they are connected va a small number of edges wth weak connectons. Nevertheless, the mportance of new bult edges at each teraton should also take nto account the current number of clusters. Thus, the formula of our separablty based crteron s wrtten as follows: P( e) J ( k) k e ξ S( k) = = (11) n k n k
S (k) computes, per cluster, the rato of new bult edges when AHC merges two clusters by movng from partton ( k 1) to k. In the crteron formula, we dvde J (k) by ( n k) n order to get a relatve evaluaton of separablty accordng to the current number of clusters. When J (k) has a relatve low value compared to other parttons, t means that the fact of movng from the ( k 1) to the k partton, weak connectons are bult, and therefore, the merged clusters are qute separable. So, the user may prefer to select the partton ( k 1) rather than the partton k. For example, Fgure 5 (c) dsplays the process of buldng edges of the Gabrel graph at each teraton of AHC provded n Fgure 5 (b). We suppose n ths example that all connectons have the same weght ( P ( ) = 1). Ths example also provdes the number of bult edges J (k) and the crteron value S (k) at each step. We note that S (k) marks a relatve low value for the partton k = 5. Ths can help the user to select the prevous partton ( k = 4 ) wth sx separable clusters. e j 30 IMPLEMENTATION AND EXPERIMENTAL RESULTS To valdate our approach, we have developed a Web based envronment platform called MnngCubes. We have ncluded n ths platform an mplementaton of OpAC. In the followng, we detal the archtecture of ths Web applcaton and present some performance experments that we have led over t. Archtecture of the Web Applcaton MnngCubes contans a set of OLAP modules lke a connecton to classcal data cubes va MS SQL Server 2000/Analyss Servces, a connecton to XML data cubes and an exploraton of multdmensonal data. In addton to these OLAP tools, we have also ntegrated analyss modules based on data mnng methods. Among these, we developed a
31 module for our operator OpAC whch s composed of four components: a Data loader component from Analyss Servces of MS SQL Server 2000 or drectly from XML documents, a Parameter settng nterface, a Clusterng component that provdes aggregates of objects, and an Aggregates evaluaton component to measure the pertnence of parttons of aggregates accordng to the crtera presented n the prevous secton. Fgure 6, shows the general archtecture of the OpAC module. In the followng, we detal the functons of each component. Fgure 6: General archtecture of the OpAC module The data loader component. Ths component connects and loads nformaton about the structure (labels of dmensons, herarchcal levels and measures),
32 and the content of a data cube. It can work ether on a data cube stored n the Analyss Servces of MS SQL Server 2000 or drectly on XML data cubes. To connect to a data cube on Analyss Servces, the data loader component uses MDX queres (Multdmensonal Expressons) to mport nformaton about the cube's structure. In the case of a connecton to an XML data cube, the component uses the DOM (Document Object Model) MSXML to parse the XML schema that represents the conceptual model of the data cube. The DOM s also used to load the data of the cube from ts correspondng XML documents. As the applcaton s based on the Web technology, a user should enter, n a Web form, a cube name, ts XML schema and ts correspondng XML documents (see Fgure 7). The applcaton wll automatcally load on the Web server the XML schema, and the XML documents. The parameter settng nterface. Ths component asssts the user to extract both ndvduals and varables from a data cube. It enables navgaton nto herarchcal levels of dmensons, selecton of attrbutes selecton of attrbutes g jt for ndvduals, g srv, and selecton of measures M q for the varables of the clusterng problem. It also provdes a user assstance respectng constrants whch we have defned n the prevous formalzaton. The clusterng component. The clusterng component enables the selecton of the dssmlarty measure and the aggregaton crteron. We mplemented four dssmlarty measures (the Eucldean Dstance, the Squared Eucldean Dstance, the Manhattan Dstance, and the Chebychev Dstance), and seven aggregaton crtera (the Ward's crteron, the Nearest Neghbor crteron, the Furthest Neghbor crteron, the Average Dstance crteron, the McQueen's
33 crteron, the Medan Clusterng crteron, and the Centrod Clusterng crteron). Once the user selects dssmlarty measure and the aggregaton crteron, the clusterng component constructs the AHC model, and plots ts results wthn a dendrogram. The aggregates evaluaton component. Ths component computes at each step of the AHC the crtera presented n the prevous secton. In fact, for each constructed partton, ths component calculates nter and ntra-clusters nertas, and the separablty based crteron. When AHC moves from a partton to the next one, ths component also calculates the sum of squared devatons accordng to the Ward's method. In the end of the AHC, the aggregates evaluaton component plots the prevous crtera results wthn graphs. Each graph presents a curve of a crteron accordng to parttons. Ths component gves an dea about the qualty of AHC parttons. It also helps the user to decde about the best number of aggregates he wants to consder. Fgure 7: An XML data cube loaded by MnngCubes
34 Performances of the Web Applcaton We have expermentally evaluated performances of our Web applcaton wthn datasets of XML documents. We have constructed these datasets by a random samplng on the whole collecton of OLAP facts from the screenng mammography data cube presented n the thrd secton 2. Recall that ths data cube contans 4 686 OLAP facts, where each fact s presented by an XML document as shows the example of Fgure 3. The current experments measure tmes processng for dfferent stuatons of nput data and parameters of our operator OpAC supported by the Web applcaton MnngCubes. We led these seres of experments under Wndows XP on a 1.60 GHz PC wth 480 MB of RAM, and an Intel Pentum 4 processor. Fgure 8: (a) Effect of XML documents' number on DOM parsng tme. (b) Effect of XML documents' number on AHC tme processng We have measured the runnng tme of the data loader component for loadng XML documents, and for constructng an XML data cube. The runnng tme of the DOM parser s summarzed by the curve of Fgure 8 (a). The general trend of the curve proves that the parsng tme has a lnear ncreasng accordng to the number of XML documents. Note that these experments were acheved on localhost, so n a real clent/server archtecture, n addton to the parsng tme we should also take nto account the communcaton tme of the used network.
35 We also evaluated the tme processng of the clusterng component. Accordng to Fgure 8 (b), the processng tme of AHC marks a polynomal ncreasng accordng to the number of documents. Indeed, there are two man expensve steps n the agglomeratve clusterng. The frst one corresponds to the computaton of the parwse dssmlarty between all the documents. Let n be the number of XML documents to cluster, the complexty of ths step s O ( n 2 ). The second step s the repeated selecton of the par of most smlar clusters. 2 Durng the teraton k, the AHC algorthm requres O (( n 1) ) tme. Ths lead to an overall complexty of O ( n 3 ). Nevertheless n OLAP context, we should note that we usually deal wth data cube dmensons wth relatvely small number of attrbutes. In addton, n the context of our operator, the AHC complexty would be avoded snce a user focus on targeted analyss wth precse, and small number, of facts to aggregate. In the next secton, we ntroduce a real case study on the XML screenng mammography data cube. APPLICATION ON THE XML SCREENING MAMMOGRAPHY DATA CUBE To llustrate the results of our operator, we propose to run t on the screenng mammography data cube presented n Fgure 2. We suppose that a user needs to create aggregates from the attrbutes of the Scanner name level ( h 71 ) of the Scanner mage dmenson ( D 7 ). We suppose that (s)he selects from G ( h 71 ) a set of 36 mammogram scanners. Fgure 9 shows the set of the selected ndvduals Ω.
Fgure 9: The set Ω of mammogram scanners selected as ndvduals A Data Mnng-Based OLAP Aggregaton 36 In order to generate varables, suppose that the user selects the attrbutes of the level Leson type name ( h 12 ) from the Leson type dmenson ( D 1) and the measure Regon length ( M 1). Accordng to Formula (3), the set of varables s: Σ = V = M ( a,, K,, ω,, 1 ) where a h 12 and ω Ω (12) Wth more explct terms, accordng to the avalable data n ths case study, the set Σ contans the followng 11 varables: V 1 : Boundary length of suspcous regon wth calcfcaton type amorphous V 2 : Boundary length of suspcous regon wth calcfcaton type lucent center V 3 : Boundary length of suspcous regon wth calcfcaton type pleomorphc V 4 : Boundary length of suspcous regon wth calcfcaton type punctate V 5 : Boundary length of suspcous regon wth calcfcaton type skn
V 6 : Boundary length of suspcous regon wth calcfcaton type vascular V 7 : Boundary length of suspcous regon wth mass shape asymmetrc breast tssue V 8 : Boundary length of suspcous regon wth mass shape rregular V 9 : Boundary length of suspcous regon wth mass shape lobulated V 10 : Boundary length of suspcous regon wth mass shape oval V 11 : Boundary length of suspcous regon wth mass shape round Now, suppose that the user wants to construct aggregates. If (s)he selects Eucldean dstance as a dssmlarty metrc and Ward's crteron as an aggregaton strategy, (s)he wll obtan the dendrogram of Fgure 10. The user wll also obtan evaluaton curves of parttons accordng to the crtera proposed n the ffth secton. Note that the obtaned dendrogram s not easy to nterpret. A breast cancer expert would not be able to decde about the best number of aggregates that fts wth hs/her analyss objectves. In ths case, valuaton crtera ncluded n our operator may provde addtonal helps for the analyss process. Fgure 11 shows resulted graphs of evaluaton crtera accordng to the number of AHC teratons k. 37
38 Fgure 10: Dendrogram generated by OpAC We notce that all crtera release remarkable leaps for small numbers of aggregates. Generally, t s not meanngful to choose parttons wth very hgh or very small number of aggregates. In fact, a sngle cluster (ncludng the whole set of ndvduals) or n clusters (where each cluster contans a sngle ndvdual) are two nsgnfcant parttons for the analyss.
39 Fgure 11: (a) Inerta ntra and nter-clusters crtera (b) Ward's crteron (c) Separablty based crteron (equal edges weghts) (d) Separablty based crteron (weghted edges) accordng to the growth of AHC teratons Recall that the choce of the best partton depends on the evaluaton crteron. It also depends on the analyss objectves of the user. In ths case study, suppose that a breast cancer expert needs to defne aggregates of smlar suspcous regons. Remember agan that an teraton k corresponds to a number of aggregates equal to ( n k). By takng nto account the prevous analyss objectve, an expert can choose the teraton k = 30, whch corresponds to 6 aggregates (36-30). In fact, n Fgure 11 (a), the ntra-cluster nerta marks a relevant ncrease when t moves from the teraton k = 29 to k = 30. The Ward's method, n Fgure 11 (b) has an ncrease tendency when we move from teraton k = 30 to k = 31. As we need to mnmze the Ward's ndcator, the user can prefer partton of teraton k = 30. We also see that the two separablty based crteron of Fgure 11 (c) and 11 (d) have relevant pcs n teraton k = 30 whch comes after low values n teraton k = 29. Furthermore, knowng
that n ths analyss context the clustered suspcous regons have three types of pathologes (Bengn, Bengn wthout callback, and Cancer), a breast cancer expert may need to get aggregates homogeneous as well as possble accordng to the type of pathology. Table 1 shows the 6 aggregates of suspcous regons that corresponds to teraton k = 30. We can see n ths table that we obtan some aggregates wth homogenous pathologes. For example, Aggregate 1 and Aggregate 6 consst n 100% of Bengn suspcous regons, whereas Aggregate 3 and the Aggregate 4 consst n 100% of Cancer suspcous regons. These results may provde knowledge about smlartes of suspcous regons nsde each aggregate. Of course, the choce of varables s qute mportant for fnal results. The more varables are correlated wth the semantc smlarty we need to see n fnal aggregates, the more we obtan homogenous aggregates accordng to that smlarty. Table 1: Aggregates of suspcous regons of the AHC teraton k = 30 Aggregate 1 5 suspcous regons 100% Bengn A_1548_1.RIGHT_CC.LJPEG A_1497_1.RIGHT_CC.LJPEG A_1471_1.RIGHT_CC.LJPEG A_1491_1.RIGHT_CC.LJPEG A_1472_1.RIGHT_MLO.LJPEG Aggregate 2 2 suspcous regons 50% Bengn 50% Bengn wthout callback A_1513_1.LEFT_CC.LJPEG B_3216_1.RIGHT_CC.LJPEG Aggregate 3 4 suspcous regons 100% Cancer C_0085_1.RIGHT_CC.LJPEG C_0128_1.RIGHT_MLO.LJPEG C_0126_1.RIGHT_MLO.LJPEG C_0170_1.LEFT_MLO.LJPEG Aggregate 4 1 suspcous regon 100% Cancer C_0218_1.LEFT_MLO.LJPEG Aggregate 5 20 suspcous regons 30% Bengn 35% Bengn wthout callback 35% Cancer C_0126_1.RIGHT_MLO.LJPEG A_1512_1.LEFT_CC.LJPEG A_1482_1.RIGHT_CC.LJPEG A_1546_1.RIGHT_CC.LJPEG A_1496_1.LEFT_CC.LJPEG B_3183_1.RIGHT_CC.LJPEG A_1542_1.RIGHT_MLO.LJPEG B_3159_1.LEFT_CC.LJPEG B_3167_1.LEFT_CC.LJPEG B_3170_1.LEFT_CC.LJPEG A_1479_1.LEFT_CC.LJPEG C_0031_1.RIGHT_CC.LJPEG B_3232_1.LEFT_CC.LJPEG B_3176_1.RIGHT_CC.LJPEG C_0158_1.LEFT_MLO.LJPEG C_0139_1.RIGHT_MLO.LJPEG C_0180_1.LEFT_CC.LJPEG C_0187_1.LEFT_CC.LJPEG C_0156_1.RIGHT_CC.LJPEG B_3245_1.LEFT_MLO.LJPEG Aggregate 6 4 suspcous regons 100% Bengn B_3185_1.LEFT_MLO.LJPEG B_3228_1.RIGHT_MLO.LJPEG B_3160_1.RIGHT_CC.LJPEG B_3186_1.LEFT_MLO.LJPEG 40
41 CONCLUSION In ths work, we propose to carry out a new on lne analyss context of complex data lke texts, mages, sounds and vdeos. Our approach s based on couplng OLAP wth data mnng. The assocaton of the two felds can be a soluton to cope wth ther respectve defects. We have created OpAC, whch s a new OLAP aggregaton operator based on an automatc clusterng method. Unlke classcal OLAP operators, our proposal enables precse analyss and provdes semantc aggregates of complex objects. In ths paper, we have generalzed ths approach and adapted OpAC to XML data cubes. Nowadays, XML s becomng a promsng soluton for warehousng complex data. We provde a formalzaton of our operator and defne the set of ndvduals and varables that a user can select from a data cube. We propose crtera to evaluate the results of our operator. These crtera help users to select the best partton of aggregates that fts wth ther analyss requrements. Our approach s developed under a Web envronment accordng to a clent/server archtecture that takes nto account data cubes modelled accordng to XML sources. The mplementaton of OpAC s acheved n a general OLAP platform called MnngCubes. We have led some experments on the developed applcaton to evaluate the performance and the complexty of our operators. These experments proved effcency of our approach. They showed ts capablty n handlng XML sources, too. We have also valdated our approach through a case study on XML data cube taken from the breast cancer doman. Ths applcaton has shown the nterest of OpAC on a real world doman where decsons are qute mportant and sometmes crtcal. Ths work has proved the nterest of assocatng OLAP and data mnng n order to enhance on lne analyss power. We beleve that, n the future, ths assocaton wll provde a new generaton of effcent OLAP operators adapted to complex data. For future work, a lot of ssues need to be addressed. Frst, we need to thnk about an automatc approach to
42 warehouse complex data wthn XML format. Ths warehousng step should not smply transform complex objects nto XML documents and store them n a repostory. It would also prepare data and represent them n an nterestng way sutable to analyss and adapted to user requrements. The second ssue s devoted to provde OLAP wth a predctve power by assocatng t to a sutable data mnng technque lke decson trees or assocaton rules. The thrd deals wth the formalzaton of an algebra that defnes a general framework of new operators that couple OLAP and data mnng. Ths algebra should establsh a generc formal background adapted to both classcal and new OLAP operators. REFERENCES Barl, X., & Bellahsène, Z. (2000). A Vew Model for XML Documents. In Proceedngs of the 6th Internatonal Conference on Object Orented Informaton Systems (OOIS 2000), London, UK, December, (pp. 429-411). Braga, D., Camp, A., Cer, S., Klemettnen, M., & Lanz, P. (2003). Dscoverng Interestng Informaton n XML Data wth Assocaton Rules, In Proceedngs of the 18th Symposum on Appled Computng, Florda, USA, March, (pp 450-454). Chaudhur, S. (1998). Data Mnng and Database Systems: Where s the Intersecton? Data Engneerng Bulletn, 21(1), 4-8. Chaudhur, S., & Dayal, U. (1997). An Overvew of Data Warehousng and OLAP Technology. SIGMOD Record, 26(1), 65-74. Chaudhur, S., Fayyad, U., & Bernhardt, J. (1999). Scalable Classfcaton over SQL Databases. In Proceedngs of the 15th Internatonal Conference on Data Engneerng (ICDE 1999), Sydney, Australa, March, (pp 470-479).
43 Chen, M., Zhu, Q., & Chen, Z. (2001). An Integrated Interactve Envronment for Knowledge Dscovery from Heterogeneous Data Resources. Informaton and Software Technology, 43, 487-496. Chen, S. Y., & Lu, X. (2005). Data mnng from 1994 to 2004: an applcaton-orentated revew. Internatonal Journal of Busness Intellgence and Data Mnng, Inderscence Publshers, 1(1), 4-21. Chen, Q., Dayal, U., & Hsu, M. (2000). An OLAP-based Scalable Web Access Analyss Engne. In Proceedngs of the 2nd Internatonal Conference on Data Warehousng and Knowledge Dscovery (DAWAK'2000), London, UK, September, (pp 210-223). Darmont, J., Boussad, O., Bentayeb, F., & Zellouf, Y. (2003). Web Multform Data Structurng for Warehousng. Multmeda Systems and Applcatons, 22, 179-194. Dong, G., Han, J., Lam, J.M.W., Pe, J., & Wang, K. (2001). Mnng Mult-Dmensonal Constraned Gradents n Data Cubes. In Proceedngs of 27th Very Large Data Bases Conference (VLDB 2001), Rome, Italy, September, (pp 321-330). Fayyad, U.M., Shapro, G.P., Smyth, P., & Uthurusamy, R. (1996). Advances n Knowledge Dscovery and Data Mnng. AAAI/MIT Press. Feng, L., & Dllon, T. (2005). An XML-enabled data mnng query language: XML-DMQL. Internatonal Journal of Busness Intellgence and Data Mnng, Inderscence Publshers, 1(1), 22-41. Fu, L. (2005). Novel Effcent Classfers Based on Data Cube. Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., 1(3), 15-27. Gabrel, K.R., & Sokal, R.R. (1969). A New Statstcal Approach to Geographc Varaton Analyss. Systematc Zoology, 18, 259-278.
44 Gol, S., & Choudhary, A. (1998). Hgh Performance Multdmensonal Analyss and Data Mnng. In Proceedngs of Hgh Performance Networkng and Computng Conference (SC'98), Orlando, USA, November. Gol, S., & Choudhary, A. (2001). PARSIMONY: An Infrastructure for parallel Multdmensonal Analyss and Data Mnng. Journal of Parallel and Dstrbuted Computng, 61, 285-321. Golfarell, M., Rzz, S., & Vrdoljak, B. (2001). Data Warehouse Desgn from XML Sources. In Proceedngs of the 4th ACM Internatonal Workshop on Data Warehousng and OLAP (DOLAP 2001), Atlanta, Georga, USA, November, (pp 40-47). Han, J. (1998). Toward On-lne Analytcal Mnng n Large Databases. SIGMOD Record, 27, 97-107. Hu, X., Song, I-Y., Han, H., Yoo, I., Prestrud, A. A., Brennan, M. F., & Brooks, A. D. (2005). Temporal rule nducton for clncal outcome analyss. Internatonal Journal of Busness Intellgence and Data Mnng, Inderscence Publshers, 1(1), 122-136. Heath, M., Bowyer, K., Kopans, D., Moore, R., & Jr, P.K. (2000). The Dgtal Database for Screenng Mammography. In Proceedngs of the 5th Internatonal Workshop on Dgtal Mammography, Toronto, Canada, June. Hümmer, W., Bauer, A., & Harde, G. (2003). XCube: XML for Data Warehouses. In Proceedngs of the 6th ACM Internatonal Workshop on Data Warehousng and OLAP (DOLAP 2003), New Orleans, Lousana, USA, (pp 33-40). Imelnsk, T., Khachyan, & L., Abdulghan, A. (2002). Cubegrades: Generalzng Assocaton Rules. Data Mnng and Knowledge Dscovery, 6(3), 219-257. Imelnsk, T., & Mannla, H. (1996). A Database Perspectve on Knowledge Dscovery. Communcaton Of The ACM, 39, 58-64. Inmon, W.H. (1996). Buldng the Data Warehouse. John Wley & Sons.
45 Kmball, R. (1996). The Data Warehouse Toolkt. John Wley & Sons. Lamrel, J.C., Franços, C., Shehab, A.S., & Hoffmann, M. (2004). New Classfcaton Qualty Estmators for Analyss of Documentary Informaton: Applcaton to Patent Analyss and Web Mappng. Scentometrcs, 60(3), 445-562. Laurent, A., Bouchon-Meuner, B., Doucet, A., Gançarsk, S., & Marsala, C. (2000). Fuzzy Data Mnng from Multdmensonal Databases. In Proceedngs of the Internatonal Symposum on Computatonal Intellgence (ISCI 2000), Kosce, Slovaka, (pp 278-283) Lebart, L., Morneau, A., & Fénelon, J.P. (1982). Statstcal Data Processng (french edton). Dunod, Pars. Maedche, A., Hotho, A., & Wese, M. (2000). Enhancng Preprocessng n Data-Intensve Domans usng Onlne-Analytcal Processng. In Proceedngs of the 2nd Internatonal Conference on Data Warehousng and Knowledge Dscovery (DaWaK 2000), London, UK, September, (pp 258-264). Manats, A., Vasslads, P., Skadopoulos, S., Vasslou, Y., Mavrogonatos, G., & Mchalaras, I. (2005). A Presentaton Model & Non-Tradtonal Vsualzaton for OLAP, Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., 1(1), 1-36. Meo, R., Psala, G., & Cer, S. (1996). A New SQL-lke Operator for Mnng Assocaton Rules. In Proceedngs of the 22nd Internatonal Conference on Very Large Data Bases Conference (VLDB 1996), Bombay, Inda, September, (pp 122-133). Messaoud, R. B., Boussad, O., & Rabaséda, S. (2004). A New OLAP Aggregaton Based on the AHC Technque. In Proceedng of the 7th ACM Internatonal Workshop on Data Warehousng and OLAP (DOLAP 2004), Washngton D.C., USA, November, (pp 65-72).
46 Nasss, V., Rajagopalaplla, R., Dllon, T.S., & Rahayu, J.W. (2005). Conceptual and Systematc Desgn Approach for XML Document Warehouses. Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., 1(3), 63-87. Palpanas, T. (2000). Knowledge Dscovery n Data Warehouses. SIGMOD Record, 29, 88-100. Parsaye, K. (1997). OLAP and Data Mnng: Brdgng the Gap. Database Programmng and Desgn, 10, 30-37. Pnto, H., Han, J., Pe, J., Wang, K., Chen, Q., & Dayal, U. (2001). Mult-dmensonal Sequental Pattern Mnng. In Proceedngs of the 10th ACM Internatonal Conference on Informaton and Knowledge Management (CIKM 01), Atlanta, USA, November. Pokornỷ, J. (2001). Modellng Stars Usng XML. In Proceedngs of the 4 th ACM Internatonal Workshop on Data Warehousng and OLAP (DOLAP 2001), Atlanta, Georga, USA, November, (pp 24-31). Robn, J., & Favero, E. (2001). HYSSOP: Natural Language Generaton Meets Knowledge Dscovery n Databases. In Proceedngs of the 3rd Internatonal Conference on Informaton Integraton and Web-based Applcatons and Servces (WAS 2001), Lnz, Austra, September. Rusu, L. I., Rahayu, J. W., & Tanar, D. (2005). A Methodology for Buldng XML Data Warehouses. Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., 1(2), 23-48. Sarawg, S. (2001). Dff : Informatve Summarzaton of Dfferences n Multdmensonal Aggregates. Data Mnng And Knowledge Dscovery, 5, 213-246. Sarawg, S., Agrawal, R., & Megddo, N. (1998). Dscovery-drven Exploraton of OLAP Data Cubes. In Proceedngs of the 6th Internatonal Conference on Extendng Database Technology (EDBT 1998), Valenca, Span, March.
47 Termer, A., Rousset, M., & Sebag, M. (2002). TreeFnder: a Frst Step towards XML Data Mnng. In Proceedngs of the 2nd IEEE Internatonal Conference on Data Mnng (ICDM 02), Maebash Cty, Japan, December, (pp 450-457) Tjoe, H.C., & Tanar, D. (2005). Mnng Assocaton Rules n Data Warehouses. Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., 1(3), 28-62. Trujllo, J., Lujàn-Mora, S., & Song, I.Y. (2004). Applyng UML and XML for Desgnng and Interchangng Informaton for Data Warehouses and OLAP Applcatons. Journal of Database Management, 15(1), 41-72. Ward, J. (1963). Herarchcal Groupng to Optmze an Objectve Functon. Journal of the Amercan Statstcal Assocaton, 58, 236-244. Zghed, D., Lallch, S., Muhlenbach, F. (2002). A statstcal Approach for Separablty of Classes. In Statstcal Learnng, Theory and Applcatons, Pars, France, November. 1 http://marathon.csee.usf.edu/mammography/database.html 2 XML documents of the screenng mammography data cube are avalable at: http://erc.unvlyon2.fr/~rbenmessaoud/?page=donnees§on=3