Mining Multiple Large Data Sources



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

What is Candidate Sampling

DEFINING %COMPLETE IN MICROSOFT PROJECT

The Greedy Method. Introduction. 0/1 Knapsack Problem

An Interest-Oriented Network Evolution Mechanism for Online Communities

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

A DATA MINING APPLICATION IN A STUDENT DATABASE

The OC Curve of Attribute Acceptance Plans

Simple Interest Loans (Section 5.1) :

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

An Alternative Way to Measure Private Equity Performance

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Time Value of Money Module

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Single and multiple stage classifiers implementing logistic discrimination

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Semantic Link Analysis for Finding Answer Experts *

8 Algorithm for Binary Searching in Trees

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Software project management with GAs

Invoicing and Financial Forecasting of Time and Amount of Corresponding Cash Inflow

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Politecnico di Torino. Porto Institutional Repository

A Secure Password-Authenticated Key Agreement Using Smart Cards

Gender Classification for Real-Time Audience Analysis System

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Enterprise Master Patient Index

Statistical Approach for Offline Handwritten Signature Verification

Can Auto Liability Insurance Purchases Signal Risk Attitude?

A Performance Analysis of View Maintenance Techniques for Data Warehouses

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Efficient Project Portfolio as a tool for Enterprise Risk Management

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Application of Multi-Agents for Fault Detection and Reconfiguration of Power Distribution Systems

Traffic-light a stress test for life insurance provisions

An MILP model for planning of batch plants operating in a campaign-mode

7.5. Present Value of an Annuity. Investigate

Lecture 2: Single Layer Perceptrons Kevin Swingler

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

IMPACT ANALYSIS OF A CELLULAR PHONE

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Optimal Choice of Random Variables in D-ITG Traffic Generating Tool using Evolutionary Algorithms

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

Calculation of Sampling Weights

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Calculating the high frequency transmission line parameters of power cables

A Fast Incremental Spectral Clustering for Large Data Sets

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

RequIn, a tool for fast web traffic inference

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures

Web Object Indexing Using Domain Knowledge *

Financial Mathemetics

Finite Math Chapter 10: Study Guide and Solution to Problems

Using Multi-objective Metaheuristics to Solve the Software Project Scheduling Problem

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Using Series to Analyze Financial Situations: Present Value

Multiple-Period Attribution: Residuals and Compounding

Estimating the Development Effort of Web Projects in Chile

14.74 Lecture 5: Health (2)

Demographic and Health Surveys Methodology

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Project Networks With Mixed-Time Constraints

Forecasting the Direction and Strength of Stock Market Movement

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Improved SVM in Cloud Computing Information Mining

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

Abstract. 260 Business Intelligence Journal July IDENTIFICATION OF DEMAND THROUGH STATISTICAL DISTRIBUTION MODELING FOR IMPROVED DEMAND FORECASTING

Design and Development of a Security Evaluation Platform Based on International Standards

LIFETIME INCOME OPTIONS

An Empirical Study of Search Engine Advertising Effectiveness

The Use of Analytics for Claim Fraud Detection Roosevelt C. Mosley, Jr., FCAS, MAAA Nick Kucera Pinnacle Actuarial Resources Inc.

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

Learning from Multiple Outlooks

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Fuzzy Set Approach To Asymmetrical Load Balancing In Distribution Networks

Transcription:

The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of Computer Scence, S. P. Chowgule College, Inda 2 Department of Computer Scence and Technology, Goa Unversty, Inda 3 Department of Computer and Informaton Scences, Florda A&M Unversty, USA 4 Department of Computer Scence, Narayan Zantye College, Inda Abstract: Effectve data analyss usng multple databases requres hghly accurate patterns. Local pattern analyss mght extract low qualty patterns from multple large databases. Thus, t s necessary to mprove mnng multple databases usng local pattern analyss. We present exstng specalzed as well as generalzed technques for mnng multple large databases. We formalze the dea of mult-database mnng usng local pattern analyss and propose a new generalzed technque for mnng multple large databases. It mproves the qualty of syntheszed global patterns sgnfcantly. We conduct experments on both real and synthetc databases to judge the effectveness of the proposed technque. Keywords: Mult-database mnng, ppelned feedback technque, synthess of patterns. Receved December 2, 28; accepted February 8, 29. Introducton Due to a lberal economc polcy adopted by many countres across the globe, the number of branches of a mult-natonal company as well as the number of multnatonal companes s ncreasng over tme. Moreover, the economes of many countres are growng at a faster rate. As a result the number of mult-branch companes wthn a country s also ncreasng. Many of these companes collect a huge amount of data through dfferent branches. Thus, many of them possess multple databases. Most of the prevous peces of data mnng work are based on a sngle database. Thus, t s necessary to study data mnng on multple databases. Many large companes operate from a number of branches located at dfferent geographcal regons. Each branch collects data contnuously and local data get stored locally. Thus, the collecton of all branch databases mght be large. Many decsons of a multbranch company are based on data stored over the branches. The challenges nvolve n makng good qualty of decsons based on large volume of data that are dstrbuted over the branches. It creates not only rsks but also offers opportuntes. One of the rsks s a sgnfcant amount nvestment on hardware and software to deal wth multple large databases. The goal of ths paper s to mprove mnng multple large databases. Based on the number of data sources, patterns n multple databases could be classfed nto three categores. They are local patterns, global patterns and patterns that are nether local nor global. A pattern based on a sngle database s called a local pattern. Local patterns are useful for local data analyss and decson makng problems [, ]. On the other hand, global patterns are based on all the databases under consderaton. They are useful for global data analyses [2, 2] and global decson makng problems. In ths paper, we propose a new mult-database mnng technque, called Ppelned Feedback Technque (PFT), for mnng / syntheszng global patterns n multple databases. The rest of the paper s organzed as follows. We formalze the dea of mult-database mnng usng local pattern analyss n secton 2. In secton 3, we dscuss exstng generalzed mult-database mnng technques. Also, we dscuss exstng specalzed mult-database mnng technques n secton 4. We propose a new mult-database mnng technque for mnng multple databases n secton 5. We defne error of an experment n secton 6. In secton 7, we provde expermental results usng both synthetc and real databases. 2. Mult-Database Mnng Usng Local Pattern Analyss Consder a large company that deals wth multple large databases. For mnng multple databases, there are three stuatons vz: a. Each of the local databases s small, so that a Sngle Database Mnng Technque (SDMT) could mne the unon of all databases. b. At least one of the local databases s large, so that a SDMT could mne every local database, but fal to mne the unon of all local databases. c. At least one of the local databases s very large, so that a SDMT fals to mne every local database. We face challenges to handle the cases (b) and (c). The challenges posed to us are due to large sze of some local databases. The frst queston comes to our

242 The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 mnd whether a tradtonal data mnng technque [4, 6] could provde a good soluton n dealng wth multple large databases. To apply a tradtonal data mnng technque one needs to amass all the branch databases together. A tradtonal data mnng technque mght not provde a good soluton due to the followng reasons. It mght not be sutable as one mght have to nvest heavly on hardware and software to deal wth a large volume of data. A sngle computer mght take unreasonable amount of tme to mne a huge amount of data. It s dffcult to dentfy local patterns f a tradtonal data mnng technque s appled on the collecton of local databases. Thus, a tradtonal data mnng technque mght not be sutable n ths stuaton. So, t s a dfferent problem. Hence, t s requred to be dealt wth n a dfferent way. Zhang et al. [4] desgned a Mult-Database Mnng Technque (MDMT) usng local pattern analyss. Multdatabase mnng usng local pattern analyss could be classfed nto two categores vz., the technques that analyze local patterns and the technques that analyze approxmate local patterns. A mult-database mnng technque usng local pattern analyss could be vewed as a two-step process τ + ξ, explaned as follows: Mne each local database usng a SDMT by applyng a technque τ (Step ). Synthesze patterns usng an algorthm ξ (Step 2). We use notaton MDMT: τ + ξ to represent a multdatabase mnng technque usng a technque of mnng τ and a syntheszng algorthm ξ. We can apply samplng technques [] for tamng large volume of data. If an temset s frequent n a large dataset then t s lkely to be frequent n the sampled dataset. Thus, we can mne patterns approxmately n a large dataset by analyzng patterns n a representatve sampled dataset. There are two categores of multdatabase mnng technques vz., specalzed and generalzed mult-database mnng technques. 3. Generalzed Mult-database Mnng Technques In ths secton, we dscuss exstng generalzed multdatabase mnng technques. These technques could be used n varety of mult-database mnng applcatons. 3.. Local Pattern Analyss Under ths model of mnng multple databases, each branch requres to mne ts database usng a tradtonal data mnng technque. Afterwards, each branch s requred to forward the pattern base to the central offce. Then the central offce could process the pattern bases collected from dfferent branches for syntheszng the global patterns or makng some global decsons. Adhkar and Rao [2] have proposed an extended model of local pattern analyss. The proposed extended model has a set of nterfaces and a set of layers. Each nterface s a set of operatons that produces dataset(s) (or knowledge) based on the dataset(s) at the next lower layer. The functons of the nterfaces are descrbed below. Interface 2/ apples dfferent operatons on data at the lowest layer. By applyng these operatons, we get a processed database from a local (orgnal) database. These operatons are performed on each branch database. Interface 3/2 apples a flterng algorthm on each processed database to separate relevant data from outler data. In partcular, f we are nterested n studyng the durable tems then the transactons contanng only non-durable tems could be treated as outler transactons. Interface 4/3 mnes local patterns n each local data warehouse. There are two types of local patterns: local patterns and suggested local patterns. A suggested local pattern s close but fals to satsfy the requste nterestngness crtera. The reasons for consderng suggested patterns are gven as follows. Frstly, one could synthesze patterns more accurately. Secondly, due to the stochastc nature of transactons, the number of suggested patterns could be sgnfcant n some databases. Thrdly, there s a tendency that a suggested pattern of one database to become a local pattern n another database. Thus, the correctness of syntheszng global patterns would ncrease as the number of local patterns ncreases. Let there are n databases of a mult-branch company. Also, let LPB and SPB be the local pattern base and suggested local pattern base for the th branch, respectvely, for =, 2,, n. Interface 5/4 syntheszes global patterns or analyses local patterns to meet real lfe challenges. Varous data preparaton technques [8] lke data cleanng, data transformaton, data ntegraton, and data reducton are appled to data n the local databases. We get the processed database PD correspondng to orgnal database D, for =, 2,, n. Then we retan all the data that are relevant to the data mnng applcatons. Usng a relevance analyss, one can detect outler data [7] from processed database. A relevance analyss s dependent on the context and vares from one applcaton to another applcaton. Let OD be the outler database correspondng to the th branch, for =, 2,, n. After removng outler data from the processed database we get desred data warehouse, and the data n a data warehouse become ready for data mnng task. Let W be the data warehouse correspondng to the th branch, for =, 2,, n. Local patterns for the th branch are extracted from W, for =, 2,, n. Fnally, the local patterns are forwarded to the central offce for syntheszng global patterns, or analyss of local patterns. Fgure llustrates a model of

Mnng Multple Large Data Sources 243 syntheszng global patterns from local patterns n dfferent databases. In partcular, f we are nterested n syntheszng global frequent temsets then an temset may not get extracted from all the databases. It s requred to estmate or gnore the support of an temset n a database that fals to report t. Thus, a global frequent temset syntheszed from local frequent temsets s approxmate n nature. If any one of the local databases s too large to apply a tradtonal data mnng technque then ths model would fal. In ths stuaton, we can apply an approprate samplng technque to reduce the sze of a large local database. Otherwse, the database can be parttoned nto sub-databases. As a result, the error of syntheszng a pattern would ncrease. Fgure. A model of syntheszng global patterns from local patterns n dfferent databases. Though the above model ntroduces many layers and nterfaces for syntheszng global patterns, but n a real lfe applcaton, many of these layers and nterfaces mght be absent. The patterns returned by local pattern analyss are approxmate. They mght dffer consderably from exact global patterns. 3.2. Partton Algorthm For the purpose of mnng multple databases, one can apply Partton Algorthm (PA) proposed by Savasere et al., [9]. The algorthm s desgned for mnng a very large database by parttonng. The algorthm works as follows. It scans a database twce. The database s dvded nto dsjont parttons, where each partton s small enough to ft n memory. In the frst scan, the algorthm reads each partton and computes locally frequent temsets n each partton usng apror algorthm [4]. In the second scan, the algorthm counts the supports of all locally frequent temsets toward the complete database. In ths case, each local database can be consdered as a partton. Though partton algorthm mnes frequent temsets n a database exactly, t s an expensve soluton to mnng multple large databases, snce each database s requred to scan twce. 3.3. IdentfyExPattern Algorthm Zhang et al., [3] have proposed algorthm IdentfyExPattern (IEP) for dentfyng global exceptonal patterns n mult-databases. Every local database s mned separately at Random Order (RO) usng a SDMT for syntheszng global exceptonal patterns. For dentfyng global exceptonal patterns n multple databases, the followng pattern syntheszng algorthm has been proposed. A pattern n a local database s assumed as zero, f t does not get reported. Let supp a (p, DB) and supp s (p, DB) be the actual (.e, apror) support and syntheszed support of pattern p n database DB, respectvely. Let D be the unon of all local databases. Then support of pattern p has been syntheszed n D based on the followng formula: supp s( p,d) = num( p) suppa( p,d )- α - α num ( p) = () where num(p) s the number of databases that report p at a gven mnmum support level (α). The sze (.e., the number of transactons) of a local database and support of an temset n a local database are seem to be mportant parameters for determnng the presence of an temset n a database, snce the number of transactons contanng the temset X n a database D s equal to supp(x, D ) sze(d ). The major concern s that the algorthm IEP does not consder the sze of a local database to synthesze the global support of a pattern. 3.4. Rule Syntheszng Algorthm Wu and Zhang [2] have proposed Rule Syntheszng (RS) algorthm for syntheszng hgh-frequent assocaton rules n multple databases. Usng ths technque, every local database s mned separately at Random Order (RO) usng a SDMT for syntheszng hgh-frequent assocaton rules. A pattern n a local database s assumed as zero, f t does not get reported. Based on the assocaton rules n dfferent databases, the authors have estmated weghts of dfferent databases. Let w be the weght of -th database, for =, 2,, n. Wthout any loss of generalty, let the assocaton rule r be extracted from frst m databases, for m n. supp a (r, D ) has been assumed as, for = m +, m + 2,, n. Then the support of r n D has been syntheszed as follows: supp s (r, D) = w supp a (r, D ) + + w m supp a (r, D m ) (2) Algorthm RS s an ndrect approach for syntheszng assocaton rules n multple databases. Thus, the tme complexty of the algorthm s reasonably hgh. The algorthm executes n O(n 4 maxnosrules totalrules 2 ) tme, where n, maxnosrules, and totalrules are the number of data sources, the maxmum among the numbers of assocaton rules

244 The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 extracted from dfferent databases, and the total number of assocaton rules n dfferent databases, respectvely. 4. Specalzed Mult-Database Mnng Technques For fndng soluton to a specfc applcaton, t mght be possble to devse a better mult-database mnng technque. In ths secton, we present two specfc mult-database mnng technques. 4.. Mnng Multple Real Databases Adhkar and Rao [2] have proposed Assocaton-Rule- Synthess (ARS) algorthm for syntheszng assocaton rules n multple real databases. The algorthm uses the model n Fgure. But, t uses a specfc rule syntheszng process explaned as follows. For real databases, the trend of the customers behavour exhbted n one database s usually present n other databases. In partcular, a frequent temset n one database s usually present n some transactons of other databases even f t does not get extracted. The estmaton procedure captures such trend and estmates the support of a mssng assocaton rule. Wthout any loss of generalty, let an temset X be extracted from frst m databases, for m n. Then trend of X n frst m databases could be expressed as follows., m m trend ( X α) = m = a D = ( supp ( X, D ) D ) (3) We can use trend of X n frst m databases for syntheszng support of X n D. We estmate support of X n database D j by α trend, n (X α), for j = k +, k + 2,, n. Then the syntheszed support of X could be computed as follows. m n supps(x,d) = trend,m( X α) (-α) D + α D n D = = = Assocaton-Rule-Synthess algorthm mght return approxmate global patterns. 4.2. Mnng Multple Databases for the Purpose of Studyng a Set of Items (4) Adhkar and Rao [3] have proposed a technque for mnng patterns of a set of specfc tems n multple databases. Many mportant decsons are based on a set of specfc tems called the select tems. A large secton of a local database s rrelevant n provdng soluton to ths problem, snce t nvolves studyng select tems n multple databases. Thus, we dvde database D nto FD and RD, where FD and RD are called the Forwarded Ddatabase and Remanng Database correspondng to the th branch respectvely, for =, 2,, n. We are nterested n the forwarded databases, snce every transacton n a forwarded database contans at least one select tem. The database FD s forwarded to the central offce for mnng global patterns of select tems under consderaton, for =, 2,, n. All the local forwarded databases are amassed nto a sngle database FD for the purpose of mnng task. The model of mnng global patterns of select tems could be explaned usng the followng steps:. Each branch offce constructs the forwarded database and sends t to the central offce. 2. Also, each branch extracts patterns from ts local database. 3. The central offce clubs these forwarded databases nto a sngle database FD. 4. A tradtonal data mnng technque could be appled to extract patterns from FD. 5. The global patterns of select tems could be extracted effectvely from local patterns and the patterns extracted from FD. At nterface 3/2, we apply an algorthm to partton a local database nto two parts vz., forwarded database and remanng database. In the followng paragraph, we dscuss how to construct FD, for =, 2,, n. Intally, FD s kept empty. Let T j be the j th transacton of D, for j =, 2,, D. For D, a forloop on j would run for D tmes. At the j th teraton, the transacton T j s tested. If T j contans at least one of the select tems then FD s updated by FD U {T j }. At the end of the for-loop on j, FD gets constructed. A tradtonal data mnng algorthm could be appled at the nterface 5/4 to extract patterns n FD. Let PB be the pattern base returned by a tradtonal data mnng algorthm. Snce, the database FD s not large, one can lower further the values of user-defned nputs, lke mnmum support, mnmum confdence, so that PB could contan more patterns of select tems. Therefore, we get a better analyss of select tems. If we wsh to study the assocaton between a select tem and other frequent tems then the exact support values of other tems mght not be avalable n PB. Then the central offce sends a request to each branch offce to forward the detals (lke support values) of some tems that would be requred to study the select tems. Thus, each branch then apples a tradtonal mnng algorthm (at nterface 3/2) on ts local database and forwards the detals of local patterns requested by the central offce. Let LPB be the detals of th local pattern base requested by the central offce, for =, 2,, n. A global mnng applcaton of select tems s requred to access local patterns and patterns n PB. Thus, a global mnng applcaton (nterface 6/5) can be developed based on the patterns n PB and LPB, for =, 2,, n. The model of mnng global patterns of select tems s effcent due to the followng reasons:

Mnng Multple Large Data Sources 245 We can extract more patterns of select tems by lowerng further the nput parameters lke mnmum support, mnmum confdence, based on the level of data analyss of select tems, snce FD s reasonably small. We get the exact global patterns of select tems as there s no need of estmatng them. Thus, the qualty of global patterns s hgh. Fgure 2. A model of mnng global patterns of select tems from multple databases. 5. Mnng Multple Databases Usng Ppelned Feedback Technque Before applyng ppelned feedback technque, one needs to prepare data warehouses at dfferent branches of a mult-branch organzaton. Let W be the data warehouse correspondng to the -th branch, for =, 2,, n. Then the local patterns for the th branch are extracted from W, for =, 2,, n. We mne each data warehouse usng a SDMT. In Fgure 3, we propose a new technque of mnng multple databases. Fgure 3. Ppelned feedback technque of mnng multple databases. In PFT, W s mned usng a SDMT and local pattern base LPB s extracted. Whle mnng W 2, all the patterns n LPB are extracted rrespectve of ther values of nterestngness measures lke, mnmum support and mnmum confdence. Apart from these patterns, some new patterns that satsfy user-defned threshold values of nterestngness measures are also extracted. In general, whle mnng W, all the patterns n W - are mned rrespectve of ther values of nterestngness measures, and some new patterns that satsfy user-defned threshold values of nterestngness measures, for = 2, 3,, n. Due to ths nature of mnng each data warehouse, PFT s called a feedback technque. Thus, LPB - LPB, for = 2, 3,, n. There are n! arrangements of ppelnng for n databases. All the arrangements of data warehouses mght not produce the same mnng result. If the number of local patterns ncreases, we get more accurate global patterns and a better analyss of local patterns. An arrangement of data warehouses would produce near optmal result f LPB n s a maxmal. Let sze(w ) be the sze of W (n bytes), for =, 2,, n. We shall follow the followng rule of thumb regardng the arrangements of data warehouses for the purpose of mnng. The number of patterns n W s greater than or equal to the number of patterns n W -, f sze(w ) sze(w - ), for = 2, 3,, n. For the purpose of ncreasng number of local patterns, W precedes W - n the ppelned arrangement of mnng data warehouses f sze(w ) sze(w - ), for = 2, 3,, n. Fnally, we analyze the patterns n LPB, LPB 2,, and LPB n for syntheszng global patterns, or analyzng local patterns. Let W be the collecton of all branch data warehouses. For syntheszng global patterns n W we dscuss here a smple pattern syntheszng (SPS) algorthm. Wthout any loss of generalty, let the temset X be extracted from frst m databases, for m n. Then syntheszed support of X n W could be obtaned as follows: m supps ( X, W ) = [ suppa ( X, W ) W ] n (5) = W = In the followng, we propose a new algorthm for mnng multple databases. The algorthm s based on the ppelned feedback technque presented n Fgure 4..25.2.5..5.5.52.54.56.58.6.62 Mnmum support.64.66.68 Fgure 4. vs. α for experments usng dataset T. Algorthm : mne multple data warehouses usng ppelned feedback technque. procedure PpelnedFeedbackTechnque (W, W 2,, W n ) Input: W, W 2,, W n Output: local pattern bases for = to n do 2 f W does not ft n memory then 3 partton W nto W, W 2,, and Wp for an nteger p ; 4 else W = W ; 5 end f 6 end for 7 sort data warehouses on sze n non-ncreasng order and the data warehouses are renamed as DW, DW 2,, DW N, where N = n = p ; 8 let LPB = φ; 9 for = to N do mne DW usng a SDMT wth nput LPB - ;

246 The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 end for 2 return LPB, LPB 2,, LPB N ; In above algorthm, the usage of LPB - durng mnng DW has been explaned above. Once a pattern s extracted from a data warehouse, then t also gets extracted from the remanng data warehouses. Thus, the algorthm PpelnedFeedbackTechnque mproves syntheszed patterns as well as an analyss of local patterns sgnfcantly. 6. Error of an Experment To evaluate MDMT:, one needs to measure the amount of error of the experments. An experment mnes frequent temsets n multple databases usng PFT, and then syntheszes global patterns usng SPS algorthm. One needs to fnd how the global syntheszed support dffers from the exact (apror) support of an temset. In PFT, we have LPB - LPB, for = 2, 3,, n. Then, patterns n LPB - LPB - are generated from databases D, D +,, D n. We assume supp a (X, D j ) =, for each X LPB - LPB -, for = 2, 3,. Thus, the error of mnng X could be defned as follows. E( X PFT, SPS) n = suppa( X, D) - n j= D for X LPB - LPB j= - j [ supp ( X, D ) D ] and = 2, 3,..., n. Also, E(X PFT,SPS)=, for X LPB. (6) There are several ways one could defne error of an experment. We have defned followng two types of error of an experment.. Average Error () ( D, α) = LPB+ n (LPB-LPB = - ) 2 X [LPB+ n = 2 (LPB -LPB - )] E(X PFT, SPS) 2. Maxmum Error (ME) ME(D, α) = maxmum{ E(X PFT,SPS), for X {LPB n + (LPB = 2 a - LBP - )}} j j, (7) (8) supp a (X, D) s obtaned by mnng D usng a tradtonal data mnng technque, for =, 2,, m. supp s (X, D) s obtaned by SPS, for =, 2,, m. 7. Experments We have carred out several experments to study the effectveness of the proposed technque. All the experments have been mplemented on a 2.8 GHz Pentum D dual core processor wth 52 MB of memory usng vsual C++ (verson 6.) software. We present expermental results usng synthetc database TI4DK (T) [5] and two real databases Retal (R) [5] and BMS-Web-Wew- (B) [5]. The databases random5 (R) and random (R2) are generated synthetcally for the purpose of conductng experments. We present some characterstcs of these databases n Table. Table. Database characterstcs. D N T ALT AFI NI T,,.228 276.243 87 R 88,62.3575 99.6738 B,49,639 2. 55.776 922 R, 6.47 9.4 5 R2, 2.4856.85785 Let NT, ALT, AFI, and NI denote the number of transactons, average length of a transacton, average frequency of an tem, and number of tems n database, respectvely. The error of syntheszng temset n multple databases s relatve to the followng parameters: the number of transactons, the number of tems, and the length of transactons n the gven databases. If the number of transactons n a database ncreases the error of syntheszng temsets ncreases, provded other two parameters reman constant. If the length of transactons of a database ncrease the error of syntheszng temsets s lkely to ncrease, provded other two parameters reman constant. Lastly, f the number of tems ncreases the error of syntheszng temsets s lkely to decrease, provded other two parameters reman constant. Each of the above databases s dvded nto databases for the purpose of carryng out experments. The databases obtaned from T, R, B, R, R2 are named as T, R, B, R, R2 respectvely, for =,,, 9. The databases T, R, B, R, R2 are called nput DataBases (DBs), for =,,, 9. Some characterstcs of these nput databases are presented n the Table 2. In Table 3, we present some outputs for the purpose of showng that the proposed technque mproves sgnfcantly the mnng results. Also, we have performed experments usng other MDMTs on these databases for the purpose of comparng wth MDMT:. Each of the Fgures 4, 5, 6, 7 and 8 shows average error aganst dfferent αs. From these fgures, one could conclude that normally ncreases as α ncreases. The number of databases reportng a pattern decreases as α ncreases. Thus, the of syntheszng patterns normally ncreases as α ncreases. Fgures 5 to 8 show that MDMT: produces more accurate mnng result among all the technques that scan each database only once.

Mnng Multple Large Data Sources 247.8.6.4.2...2.3.4.5.6.7.8.9 Mnmum support.5.4.3.2..3.34.38.42 Mnmum support.46 Fgure 5. vs. α for experments usng dataset R. Fgure 7. vs. α for experments usng dataset R...8.6.4.2.2.3.4.5 Mnmum support.6.7.8.9.2.2 Fgure 6. vs. α for experments usng dataset B..6.5.4.3.2..2.24.28.32 Mnmum support.36 Fgure 8. vs. α for experments usng dataset R2. Table 2. Input database characterstcs. DB NT ALT AFI NI DB NT ALT AFI NI T.55 27.6559 866 T5.39 28.627 866 T.333 28.48 867 T6.78 28.5625 864 T2.67 T7 27.647 867.984 28.4538 864 T3.226 T8 28.4365 866.85 28.5557 862 T4.367 T9 28.748 865.84 28.87 865 R 9.2439 R5 2.7 8384 9.8558 6.798 5847 R 9.292 R6 2.2654 8225 9.2 7.455 5788 R2 9.3367 R7 4.59657 699 9.55 7.3455 5788 R3 9.4898 R8 6.66259 626 9.997 8.693 5777 R4 9.9568 R9 6.3953 648 762.692 5.3479 5456 B 4 2. 4.943 874 B5 4 2. 28. B 4 2. 28. B6 4 2. 28. B2 4 2. 28. B7 4 2. 28. B3 4 2. 28. B8 4 2. 28. B4 4 2. 28. B9 23639 2. 472.78 R 6.367.734 5 R5 6.338.676 5 R 6.52.4 5 R6 6.624.248 5 R2 6.42.84 5 R7 6.45.83 5 R3 6.523.46 5 R8 6.579.58 5 R4 6.298.596 5 R9 6.652.34 5 R2 6.42 5.4347 996 R25 6.444 5.4553 997 R2 6.44 5.435 995 R26 6.477 5.4949 996 R22 6.556 5.5798 995 R27 6.477 5.4884 997 R23 6.529 5.537 998 R28 6.538 5.5572 996 R24 6.5 5.548 99 R29 6.5 5.5597 988 Table 3. Error of the experments at gven α. Database TI4DK retal BMS-Web-Wew- random5 random α.5..9.5.4 Error type ME ME ME ME ME MDMT:.22.373.52.583.52.583.74.95.57.76 MDMT:.2.36.5.576.5.576.56.78.4.5 MDMT:.72.36.49.573.49.573.58.7.49.62 MDMT: PFM+SPS.32.359.48.573.48.573.26.45.27.43 MDMT:

248 The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 8. Conclusons In ths paper, we dscuss exstng generalzed as well as specalzed mult-database mnng technques. For a partcular problem, one technque s more sutable than others. Thus, one needs to study the detals of each mult-database mnng technque, so that one can select the rght technque for solvng a partcular problem. We formalze the dea of mult-database mnng usng local pattern analyss by consderng t as a two-step process. We propose here a new technque for mnng multple large databases. It mproves sgnfcantly the accuracy of mnng multple databases as compared to the exstng technques that scan each database only once. MDMT: s effectve and promsng. The proposed technque could also be used for mnng a large database by dvdng t nto sub-databases. References [] Adhkar A. and Rao P., Effcent Clusterng of Databases Induced by Local Patterns, Decson Support Systems, vol. 44, no. 4, pp. 925-943, 28. [2] Adhkar A. and Rao P., Syntheszng Heavy Assocaton Rules from Dfferent Real Data Sources, Pattern Recognton Letters, vol. 29, no., pp. 59-7, 28. [3] Adhkar A. and Rao P., Study of Select Items n Multple Databases by Groupng, n Proceedngs of 3 rd Indan Internatonal Conference on Artfcal Intellgence, pp. 699-78, 27. [4] Agrawal R. and Srkant R., Fast Algorthms for Mnng Assocaton Rules, n Proceedngs of Very Large Data Bases, pp. 487-499, Santago, Chle, 994. [5] Frequent temset mnng dataset repostory, http://fm.cs.helsnk.f/data/. [6] Han J., Pe J., and Yn Y., Mnng Frequent Patterns Wthout Canddate Generaton, n Proceedngs of SIGMOD, pp. -2, Dallas, Texas, USA, 2. [7] Last M. and Kandel A., Automated Detecton of Outlers n Real-World Data, n Proceedngs of the Second Internatonal Conference on Intellgent Technologes, pp. 292-3, 2. [8] Pyle D., Data Preparaton for Data Mnng, Morgan Kufmann, San Francsco, 999. [9] Savasere A., Omecnsk E., and Navathe S., An Effcent Algorthm for Mnng Assocaton Rules n Large Databases, n Proceedngs of Very Large Data Bases, pp. 432-443, 995. [] Tovonen H., Samplng Large Databases for Assocaton Rules, n Proceedngs of the 22 th Internatonal Conference on Very Large Data Bases, pp. 34-45, San Francsco, CA, USA, 996. [] Wu X., Zhang C., and Zhang S., Database Classfcaton for Mult-Database Mnng, Informaton Systems, vol. 3, no., pp. 7-88, 25. [2] Wu X. and Zhang S., Syntheszng Hgh- Frequency Rules from Dfferent Data Sources, IEEE Transactons on Knowledge and Data Engneerng, vol. 5, no. 2, pp. 353-367, 23. [3] Zhang C., Lu M., Ne W., and Zhang S., Identfyng Global Exceptonal Patterns n Mult-Database Mnng, IEEE Computatonal Intellgence Bulletn, vol. 3, no., pp. 9-24, 24. [4] Zhang S., Wu X., and Zhang C., Mult- Database Mnng, IEEE Computatonal Intellgence Bulletn, vol. 2, no., pp. 5-3, 23. Anmesh Adhkar s a lecturer n the Department of Computer Scence, S P Chowgule College, Inda. In June, 28, he has submtted doctoral dssertaton n the Department of Computer Scence and Technology, Goa Unversty, Inda. He receved Master of technology n computer scence from Indan Statstcal Insttute, Inda. Hs areas of nterest nclude data mnng and knowledge dscovery, decson support systems, database systems, and artfcal ntellgence. Pralhad Ramachandrarao s a professor n the Department of Computer Scence and Technology, Goa Unversty, Inda. He receved hs PhD degree from Indan Insttute of Technology, Mumba, Inda. Hs areas of nterest nclude graph theory, data mnng and knowledge dscovery, and data warehousng. Bhanu Prasad receved Master of technology and PhD degrees n computer scence, from Andhra Unversty and Indan Insttute of Technology Madras, respectvely. Currently he s workng as a faculty member n the Department of Computer and Informaton Scences at Florda A&M Unversty n Tallahassee, Florda, USA. Hs research nterests nclude artfcal ntellgence wth a specal focus on knowledge representaton, reasonng, and product recommendng systems.

Mnng Multple Large Data Sources 249 Jhml Adhkar s a lecturer n the Department of Computer Scence n Narayan Zantye College, Bcholm, Goa. She receved Master of computer applcaton from Jadavpur Unversty, Kolkata, Inda. Currently, she s a PhD student at the Department of Computer Scence and Technology, Goa Unversty, Inda.