Mining Association Rules on Grid Platforms



Similar documents
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Various Schemes of Load Balancing in Distributed Systems- A Review

Performance Improvement of Association Rule Mining Algorithms through Load Balancing in Distributed Computing Platform

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

How To Balance In Cloud Computing

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Efficient Load Balancing using VM Migration by QEMU-KVM

Figure 1. The cloud scales: Amazon EC2 growth [2].

Dynamic Load Balancing in a Network of Workstations

Elastic Load Balancing in Cloud Storage

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Load Distribution in Large Scale Network Monitoring Infrastructures

CHAPTER 1 INTRODUCTION

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Virtual Network Provisioning and Fault-Management across Multiple Domains

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improved Hybrid Dynamic Load Balancing Algorithm for Distributed Environment

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

A Survey on Load Balancing Technique for Resource Scheduling In Cloud

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Cray: Enabling Real-Time Discovery in Big Data

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

G Porcupine. Robert Grimm New York University

Journal of Theoretical and Applied Information Technology 20 th July Vol.77. No JATIT & LLS. All rights reserved.

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Ground up Introduction to In-Memory Data (Grids)

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Load Balancing of Web Server System Using Service Queue Length

Keywords Load balancing, Dispatcher, Distributed Cluster Server, Static Load balancing, Dynamic Load balancing.

Mesh Partitioning and Load Balancing

Symmetric Multiprocessing

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Game Theory Based Load Balanced Job Allocation in Distributed Systems

MOSIX: High performance Linux farm

The Benefits of Virtualizing

Mining for Web Engineering

Energy Efficient MapReduce

NextGen Infrastructure for Big DATA Analytics.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

SAP HANA In-Memory Database Sizing Guideline

Protect Data... in the Cloud

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Chapter 2 Parallel Architecture, Software And Performance

Lecture 2 Parallel Programming Platforms

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

A Novel Switch Mechanism for Load Balancing in Public Cloud

Oracle Database In-Memory The Next Big Thing

Maximizing Hadoop Performance with Hardware Compression

Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing

Part 2: Community Detection

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Grid e-services for Multi-Layer SOM Neural Network Simulation

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Load Distribution on a Linux Cluster using Load Balancing

Bigdata High Availability (HA) Architecture

Flash-Friendly File System (F2FS)

The International Journal Of Science & Technoledge (ISSN X)

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Maximizing SQL Server Virtualization Performance

Load balancing in SOAJA (Service Oriented Java Adaptive Applications)

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

A Novel Load Balancing Algorithms in Grid Computing

Data Mining for Data Cloud and Compute Cloud

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD

<Insert Picture Here> Introducing Oracle VM: Oracle s Virtualization Product Strategy

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Scaling in a Hypervisor Environment

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Parallels Cloud Server 6.0

Network Infrastructure Services CS848 Project

LOAD BALANCING IN CLOUD COMPUTING

Survey on Load Rebalancing for Distributed File System in Cloud

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Dynamic Resource allocation in Cloud

Transcription:

UNIVERSITY OF TUNIS EL MANAR FACULTY OF SCIENCES OF TUNISIA Mining Association Rules on Grid Platforms Raja Tlili raja_tlili@yahoo.fr Yahya Slimani yahya.slimani@fst.rnu.tn CoreGrid 11

Plan Introduction Association rules The need of parallel computing Workload balancing: Problem description Workload balancing in association rule mining algorithms Workload balancing in Grid computing The proposed load balancing model The dynamic load balancing strategy Experimental results 2

Introduction (1) Data vs Knowledge Databases Data : involved Knowledge : hidden Knowledge Knowledge is most important than data Decision making To increase revenues and reduce costs Data Mining 3

Introduction (2) What is data mining Extracting knowledge from a large volume of data Non trivial Implicit Previously unkown Potentially useful 4

Association rules (1) Association rules (1) The use of knowledge catalog design advertizing strategies 5

Association rules (2) Finding the rule A B with support >= minsup and a confidence >= minconf support, s, probability that a transaction contain {A, B} confidence, c, conditional probability that a transaction containing A will also contains B Confiance=support(A,B)/support(B) Clients buying both Transaction T 1 Items Clients buying milk A B C D E F G H I T 2................ T 3................ T 4................ Clients buying sugar Transactionnal database 6

Extracting association rules : how? The support and confidence thresehlods are fixed by the user MinSup MinConf Objectif : Finding all association rules respecting that MinSup and this MinConf Problem decomposition 1. Finding all frequent itemsets (support MinSup) 2. Generating association rules (confidence MinConf) 7

The need of parallel computing Databases to be mined are often very large ( in GB and TB ) Transactional database have to be scanned repeatedly (iteratively) Databases to be mined are often very large The need of fast algorithms for discovering association rules Cost of disk access 8

Main challenges facing parallelism Workload balancing Workload Synchronisation & Communication minimization Balancing Finding good data layout & data decomposition Disk I/O minimization 9

Load balancing: Problem description Work load balancing is the assignment of work to processors in a way that maximizes application performance Minimizing processor idle time inter-processor communication 10

Causes of load imbalance Homogeneous environment Even if we equally partition the DB, the imbalance would occur due to the differences in data correlation. Heterogeneous platforms Have different processor capacities and network speed. (Example : heterogeneous clusters, grid platforms) 11

Related work The majority of current approaches use static load balancing based on finding some intelligent way for partitionning the database before execution [Marteen Altorf 2007]. 12

Proposed Load Balancing Approach: Characteristics Taxonomy of load balancing policies Static Dynamic Reassignment Centralized Distributed One-time Dynamic Local Global Adaptive Non-Adaptive Cooperative Non-Cooperative 13

Proposed Load Balancing Approach: Goals Improving the efficiency and the scalability of ARM algorithms under Grid platforms : Exploiting prallelism at various levels ; considering the particular features of the target platform Focusing on adaptiveness: Dynamic policies for load balancing and partitioning. 14

Proposed load balancing model Let G = (S 1, S 2,, S T ) S i = (M i, Coord(S i ), Mem i, Stor i, Band i ) M i : total number of clusters in S i BD1 Coord (cl ij ): Cluster coordinator Cl ij : Cluster j of S i BD3 Network Coord(S i ) : coordinator node of the site S i Mem i : memory size Stor i : capacity of the storage subsystem BD1 BD3.. BD3 Band i : bandwidth size of the network NN i Mem = i Mem j = 1 i, j nd ijk : node k of cl ij Coord (S i ) : Site coordinator NN i Stor i = Stor j = 1 i, j BD2 BD2 S i : Site i 15

Load balancing strategy : (1) Before execution DB Partition 1 DB Partition 2. DB Partition n S 1 S 2 S n Processing Processing Processing Network Network 16

Load balancing strategy : (1) Before execution Steps : Step I : K=1 S 1 D Coord(S i ) P 0 P 1 P 2 P 3 S 1 S 2 S 3 Partitioning the database D between sites according to their capacities. Every processor has its local database Merging local results by the end of each iteration 17

Load balancing strategy : (2) During execution ❶ From the intra-site level State Vector State Vector Network the coordinator updates its global workload vector by acquiring workload information from each local node. 18

Load balancing strategy : (2) During execution ❶ From the Grid level Global State Information Global State Information Network Global State Information the coordinators of different sites periodically exchange their global state information. 19

Load balancing strategy : (2) During execution ❷ Intra Site Candidates Migration {A,B,C,..} Network EET i,j > Coefinter * ( CCN i,j,k + EET i,k ) 20

Load balancing strategy : (2) During execution ❷ Inter Site Transactions Migration T : A,B,C,I,J T: D,E,F,H,I,K T:D,F,H,I,H,J.. T: C,F,J,L,M Network EET i,j > Coefintra * ( CCS i,p + EET p,q ) 21

Load balancing strategy : (2) During execution ❸ The coordinator sends migration plan to all processing nodes and instructs them to reallocate the work load. The previously mentioned process is periodically invoked. Coordinators check the work load imbalance condition every fixed period of time. 22

Experimental results Grille Experimentation under a Grid computing environment: Grid 5000 constituted of 5000 CPU distributed over 9 sites : Lille, Rennes, Orsay, Nancy, Lyon, Bordeaux, Grenoble, Toulouse, Sophia. 23

Experimental results Database size Transactions number Items number Average transaction size DB100T13M 100 MB 1 300 000 4000 25 2 Sites 2500 (b) DB100T13M Time seq Each site contains 2 Clusters 2000 // without loadbalancing // with loadbalancing 16 computational Nodes : 3 nodes/cluster 1, 2 nodes/cluster 2, 4 nodes/cluster 3 7 nodes/cluster 4 Run time (sec) 1500 1000 500 0 0.5% 1% 1.5% 2% 2.5% 3% min support (%) 24

Experimental results There is not a fixed optimal number of processors that could be used for execution. The number of processors used should be proportional to the size of data sets to be mined. The easiest way to determine that optimal number is via experiments. 25

Conclusion and future works Association rule mining algo. have a simple statement, but they are computationally and I/O intensive (performance problem). Parallel & distributed computing is essential for providing scalable mining solutions, and can play an important role in ameliorating performances. The dynamic nature of association rule mining algorithms causes load-imbalance between the processing nodes during execution, and dynamic load balancing strategies are needed to solve this problem. 26

Conclusion and future works We developed a distributed dynamic load balancing strategy, under a Grid Computing environment. Experimentations showed that our strategy succeeded in reducing the execution time of iterative association rule mining algorithms (good distribution of workload among the processors of the Grid). Work migration is known since a long time in «task scheduling» Adapting it to ARM algorithms. Executing ARM algorithms under Grid platforms and obtaining good results, even with the various phases of synchronizations. Parameters of the strategy are fixed according to the characteristics and the specificities of this technique. 27

UNIVERSITY OF TUNIS EL MANAR FACULTY OF SCIENCES OF TUNIS Raja Tlili raja_tlili@yahoo.fr Yahya Slimani yahya.slimani@fst.rnu.tn