UNIVERSITY OF TUNIS EL MANAR FACULTY OF SCIENCES OF TUNISIA Mining Association Rules on Grid Platforms Raja Tlili raja_tlili@yahoo.fr Yahya Slimani yahya.slimani@fst.rnu.tn CoreGrid 11
Plan Introduction Association rules The need of parallel computing Workload balancing: Problem description Workload balancing in association rule mining algorithms Workload balancing in Grid computing The proposed load balancing model The dynamic load balancing strategy Experimental results 2
Introduction (1) Data vs Knowledge Databases Data : involved Knowledge : hidden Knowledge Knowledge is most important than data Decision making To increase revenues and reduce costs Data Mining 3
Introduction (2) What is data mining Extracting knowledge from a large volume of data Non trivial Implicit Previously unkown Potentially useful 4
Association rules (1) Association rules (1) The use of knowledge catalog design advertizing strategies 5
Association rules (2) Finding the rule A B with support >= minsup and a confidence >= minconf support, s, probability that a transaction contain {A, B} confidence, c, conditional probability that a transaction containing A will also contains B Confiance=support(A,B)/support(B) Clients buying both Transaction T 1 Items Clients buying milk A B C D E F G H I T 2................ T 3................ T 4................ Clients buying sugar Transactionnal database 6
Extracting association rules : how? The support and confidence thresehlods are fixed by the user MinSup MinConf Objectif : Finding all association rules respecting that MinSup and this MinConf Problem decomposition 1. Finding all frequent itemsets (support MinSup) 2. Generating association rules (confidence MinConf) 7
The need of parallel computing Databases to be mined are often very large ( in GB and TB ) Transactional database have to be scanned repeatedly (iteratively) Databases to be mined are often very large The need of fast algorithms for discovering association rules Cost of disk access 8
Main challenges facing parallelism Workload balancing Workload Synchronisation & Communication minimization Balancing Finding good data layout & data decomposition Disk I/O minimization 9
Load balancing: Problem description Work load balancing is the assignment of work to processors in a way that maximizes application performance Minimizing processor idle time inter-processor communication 10
Causes of load imbalance Homogeneous environment Even if we equally partition the DB, the imbalance would occur due to the differences in data correlation. Heterogeneous platforms Have different processor capacities and network speed. (Example : heterogeneous clusters, grid platforms) 11
Related work The majority of current approaches use static load balancing based on finding some intelligent way for partitionning the database before execution [Marteen Altorf 2007]. 12
Proposed Load Balancing Approach: Characteristics Taxonomy of load balancing policies Static Dynamic Reassignment Centralized Distributed One-time Dynamic Local Global Adaptive Non-Adaptive Cooperative Non-Cooperative 13
Proposed Load Balancing Approach: Goals Improving the efficiency and the scalability of ARM algorithms under Grid platforms : Exploiting prallelism at various levels ; considering the particular features of the target platform Focusing on adaptiveness: Dynamic policies for load balancing and partitioning. 14
Proposed load balancing model Let G = (S 1, S 2,, S T ) S i = (M i, Coord(S i ), Mem i, Stor i, Band i ) M i : total number of clusters in S i BD1 Coord (cl ij ): Cluster coordinator Cl ij : Cluster j of S i BD3 Network Coord(S i ) : coordinator node of the site S i Mem i : memory size Stor i : capacity of the storage subsystem BD1 BD3.. BD3 Band i : bandwidth size of the network NN i Mem = i Mem j = 1 i, j nd ijk : node k of cl ij Coord (S i ) : Site coordinator NN i Stor i = Stor j = 1 i, j BD2 BD2 S i : Site i 15
Load balancing strategy : (1) Before execution DB Partition 1 DB Partition 2. DB Partition n S 1 S 2 S n Processing Processing Processing Network Network 16
Load balancing strategy : (1) Before execution Steps : Step I : K=1 S 1 D Coord(S i ) P 0 P 1 P 2 P 3 S 1 S 2 S 3 Partitioning the database D between sites according to their capacities. Every processor has its local database Merging local results by the end of each iteration 17
Load balancing strategy : (2) During execution ❶ From the intra-site level State Vector State Vector Network the coordinator updates its global workload vector by acquiring workload information from each local node. 18
Load balancing strategy : (2) During execution ❶ From the Grid level Global State Information Global State Information Network Global State Information the coordinators of different sites periodically exchange their global state information. 19
Load balancing strategy : (2) During execution ❷ Intra Site Candidates Migration {A,B,C,..} Network EET i,j > Coefinter * ( CCN i,j,k + EET i,k ) 20
Load balancing strategy : (2) During execution ❷ Inter Site Transactions Migration T : A,B,C,I,J T: D,E,F,H,I,K T:D,F,H,I,H,J.. T: C,F,J,L,M Network EET i,j > Coefintra * ( CCS i,p + EET p,q ) 21
Load balancing strategy : (2) During execution ❸ The coordinator sends migration plan to all processing nodes and instructs them to reallocate the work load. The previously mentioned process is periodically invoked. Coordinators check the work load imbalance condition every fixed period of time. 22
Experimental results Grille Experimentation under a Grid computing environment: Grid 5000 constituted of 5000 CPU distributed over 9 sites : Lille, Rennes, Orsay, Nancy, Lyon, Bordeaux, Grenoble, Toulouse, Sophia. 23
Experimental results Database size Transactions number Items number Average transaction size DB100T13M 100 MB 1 300 000 4000 25 2 Sites 2500 (b) DB100T13M Time seq Each site contains 2 Clusters 2000 // without loadbalancing // with loadbalancing 16 computational Nodes : 3 nodes/cluster 1, 2 nodes/cluster 2, 4 nodes/cluster 3 7 nodes/cluster 4 Run time (sec) 1500 1000 500 0 0.5% 1% 1.5% 2% 2.5% 3% min support (%) 24
Experimental results There is not a fixed optimal number of processors that could be used for execution. The number of processors used should be proportional to the size of data sets to be mined. The easiest way to determine that optimal number is via experiments. 25
Conclusion and future works Association rule mining algo. have a simple statement, but they are computationally and I/O intensive (performance problem). Parallel & distributed computing is essential for providing scalable mining solutions, and can play an important role in ameliorating performances. The dynamic nature of association rule mining algorithms causes load-imbalance between the processing nodes during execution, and dynamic load balancing strategies are needed to solve this problem. 26
Conclusion and future works We developed a distributed dynamic load balancing strategy, under a Grid Computing environment. Experimentations showed that our strategy succeeded in reducing the execution time of iterative association rule mining algorithms (good distribution of workload among the processors of the Grid). Work migration is known since a long time in «task scheduling» Adapting it to ARM algorithms. Executing ARM algorithms under Grid platforms and obtaining good results, even with the various phases of synchronizations. Parameters of the strategy are fixed according to the characteristics and the specificities of this technique. 27
UNIVERSITY OF TUNIS EL MANAR FACULTY OF SCIENCES OF TUNIS Raja Tlili raja_tlili@yahoo.fr Yahya Slimani yahya.slimani@fst.rnu.tn