Mining Large Datasets: Case of Mining Graph Data in the Cloud



Similar documents
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Large-Scale Data Processing

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Jeffrey D. Ullman slides. MapReduce for data intensive computing

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Chapter 7. Using Hadoop Cluster and MapReduce

Log Mining Based on Hadoop s Map and Reduce Technique

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Architectures for Big Data Analytics A database perspective

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Approaches for parallel data loading and data querying

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Hadoop. Sunday, November 25, 12

Hadoop Ecosystem B Y R A H I M A.

BIG DATA What it is and how to use?

How To Scale Out Of A Nosql Database

A Brief Outline on Bigdata Hadoop

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Advances in Natural and Applied Sciences

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data. Lyle Ungar, University of Pennsylvania

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

MapReduce and Hadoop Distributed File System

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Systems, Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop IST 734 SS CHUNG

Scaling Out With Apache Spark. DTL Meeting Slides based on

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data Systems CS 5965/6965 FALL 2015

Big Data and Apache Hadoop s MapReduce

Energy Efficient MapReduce

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Big Data on Microsoft Platform

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Efficient Analysis of Big Data Using Map Reduce Framework

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Map Reduce & Hadoop Recommended Text:

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Map-Reduce for Machine Learning on Multicore

GeoGrid Project and Experiences with Hadoop

Big Data Analytics Hadoop and Spark

Apache Hadoop. Alexandru Costan

Cloud computing doesn t yet have a

A Performance Analysis of Distributed Indexing using Terrier

Cloud Computing at Google. Architecture

NoSQL Data Base Basics

Intro to Map/Reduce a.k.a. Hadoop

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Report: Declarative Machine Learning on MapReduce (SystemML)

Big Data: Tools and Technologies in Big Data

Hadoop Cluster Applications

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Challenges for Data Driven Systems

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Implementing Graph Pattern Mining for Big Data in the Cloud

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Application Execution on Cloud using Hadoop Distributed File System

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Processing with Apache Spark. Matei

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Concept and Project Objectives

Comparison of Different Implementation of Inverted Indexes in Hadoop

Analysis of MapReduce Algorithms

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

HiBench Introduction. Carson Wang Software & Services Group

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

The Internet of Things and Big Data: Intro

Transcription:

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 1 / 50

Context and motivations Background Application domains Computer networks, Social networks, Bioinformatics, Chemoinformatics. Graph representation Data modeling. Identifying relationship patterns and rules. Protein structure Chemical compound Social network Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 2 / 50

Context and motivations Background Mining graph data Graph mining aims to find patterns, hidden relations and behaviors in data. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 3 / 50

Context and motivations Background Mining graph data Graph mining aims to find patterns, hidden relations and behaviors in data. Mining graph goals Computing graph properties: Density, diameter, radius,... Mining substructures from graph databases. Substructures: paths, trees, subgraphs. Frequent Subgraph Mining (FSM) task. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 3 / 50

Context and motivations Background Availability of graph data Exponential growth in both size and number of graphs in databases. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Context and motivations Background Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010]. Google processes 20 petabytes of data per day [Dean 2008]. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Context and motivations Background Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010]. Google processes 20 petabytes of data per day [Dean 2008]. 3Vs of Big Data (Volume, Velocity and Variety). Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Context and motivations Background Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010]. Google processes 20 petabytes of data per day [Dean 2008]. 3Vs of Big Data (Volume, Velocity and Variety). Availability of cloud computing environments. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Context and motivations Background In this work We are interested to FSM from graph databases. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 5 / 50

Context and motivations Background In this work We are interested to FSM from graph databases. Frequent subgraph mining algorithms Various approaches of FSM. Existing approaches are mainly: Tested on centralized computing systems. Evaluated on relatively small databases. Few works for FSM in the cloud. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 5 / 50

Goals Questions Distributed FSM from large graph database. Data/computation distribution. Tuning cloud parameters. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 6 / 50

Outline 1 Background 2 3 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 7 / 50

Outline Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 3 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 8 / 50

Outline Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 9 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Graph A graph is denoted as G = (V,E) where V is a set of nodes and E is a set of edges. Subgraph A graph G = (V,E ) is a subgraph of another graph G = (V,E) iff: V V, and E E (V V ). Density The density of a graph G = (V,E) is calculated by density(g) = 2 E ( V ( V 1)). Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 10 / 50

Outline Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 11 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Cloud computing Large number of computers that are connected via Internet. Applications delivered as services. Hardware and system software delivered as services. Pay as you go. Cloud services can be rapidly and elastically provisioned. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 12 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Service models Software as a Service (SaaS). Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 13 / 50

Outline Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 14 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works MapReduce framework A framework for processing huge datasets. Large number of computers and task/node failures. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 15 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works MapReduce framework A framework for processing huge datasets. Large number of computers and task/node failures. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 15 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 16 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works SPARK framework A general engine for large-scale data processing. Combine SQL, streaming, and complex analytics. It offers several high-level operators that make it easy to build parallel applications. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 17 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works SHARK framework A distributed SQL query engine for Hadoop. Based on SPARK and uses the existing Hive client and metastore. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 18 / 50

Outline Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 19 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Cloud-based FSM techniques Cloud-based FSM approaches from: 1 Single large graphs (MRPF [Liu 2009] and Wu etal. s approach [Wu 2010]). MRPF [Liu 2009], and Wu etal. s approach [Wu 2010]. 2 Massive graph databases (Hill etal. s [Hill 2012] and Luo etal. s [Luo 2011]). Hill etal. s [Hill 2012], and Luo etal. s [Luo 2011]. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 20 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 21 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 21 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Three crucial problems with existing approaches: Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Three crucial problems with existing approaches: 1 No data partitioning according to data characteristics. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Three crucial problems with existing approaches: 1 No data partitioning according to data characteristics. 2 Do not include the monetary aspect of cloud computing. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Three crucial problems with existing approaches: 1 No data partitioning according to data characteristics. 2 Do not include the monetary aspect of cloud computing. 3 Construct the final set of frequent subgraphs iteratively. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works In this work We focus on distributed FSM techniques from large graph databases. Three crucial problems with existing approaches: 1 No data partitioning according to data characteristics. 2 Do not include the monetary aspect of cloud computing. 3 Construct the final set of frequent subgraphs iteratively. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 22 / 50

Distributed subgraph mining in the cloud Outline 1 Background 2 Distributed subgraph mining in the cloud 3 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 23 / 50

Distributed subgraph mining in the cloud Outline 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 24 / 50

Distributed subgraph mining in the cloud Problem formulation Notations DB ={G 1,...,G K } is a large scale graph database, SM ={M 1,...,M N } is a set of distributed machines, θ [0,1] is a minimum support threshold, Part(DB) ={Part 1 (DB),...,Part N (DB)} is a partitioning of the database over SM such that Part j (DB) DB is a non-empty subset of DB, Ni=1 {Part i (DB)} = DB,and, i j,part i (DB) Part j (DB) = /0. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 25 / 50

Distributed subgraph mining in the cloud Problem formulation Globally frequent subgraph For a given minimum support threshold θ [0,1], G is globally frequent subgraph if Support(G,DB) θ. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 26 / 50

Distributed subgraph mining in the cloud Problem formulation Globally frequent subgraph For a given minimum support threshold θ [0,1], G is globally frequent subgraph if Support(G,DB) θ. Locally frequent subgraph For a given minimum support threshold θ [0,1] and a tolerance rate τ [0,1], G is locally frequent subgraph at site i if Support(G,Part i (DB)) ((1 τ) θ). Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 26 / 50

Distributed subgraph mining in the cloud Problem formulation Globally frequent subgraph For a given minimum support threshold θ [0,1], G is globally frequent subgraph if Support(G,DB) θ. Locally frequent subgraph For a given minimum support threshold θ [0,1] and a tolerance rate τ [0,1], G is locally frequent subgraph at site i if Support(G,Part i (DB)) ((1 τ) θ). Loss rate Given S 1 and S 2 two sets of subgraphs with S 2 S 1 and S 1 /0, we define the loss rate in S 2 compared to S 1 by: LossRate(S 1,S 2 ) = S 1 S 2. S 1 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 26 / 50

Distributed subgraph mining in the cloud System overview Approach overview Two-step approach: 1 Partitioning step, 2 Mining step. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 27 / 50

Distributed subgraph mining in the cloud Partitioning step Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 28 / 50

Distributed subgraph mining in the cloud Partitioning step Partitioning methods Many partitioning methods are possible. We consider: 1 MRGP: the default MapReduce partitioning method. 2 DGP: a density-based partitioning method. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 29 / 50

Distributed subgraph mining in the cloud Partitioning step Partitioning methods Many partitioning methods are possible. We consider: 1 MRGP: the default MapReduce partitioning method. 2 DGP: a density-based partitioning method. MRGP Based on the size on disk. Map-skew problems (highly variable runtimes). No data characteristics included. DGP Based on graph density. May ensures load balancing among machines. May exploit other data characteristics. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 29 / 50

Distributed subgraph mining in the cloud Map-Skew problems Map-skew Skew: highly variable task runtimes. Origin: Characteristics of the algorithm. Characteristics of the dataset. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 30 / 50

Distributed subgraph mining in the cloud Partitioning step: DGP method DGP overview Two-levels approach: 1 Dividing the graph database into B buckets, 2 Constructing the final list of partitions. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 31 / 50

Distributed subgraph mining in the cloud Distributed FSM step Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 32 / 50

Distributed subgraph mining in the cloud Distributed FSM step Distributed FSM step A single MapReduce job. Input: a set of partitions. Output: the set of globally frequent subgraphs. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 33 / 50

Distributed subgraph mining in the cloud Distributed FSM step Distributed FSM step A single MapReduce job. Input: a set of partitions. Output: the set of globally frequent subgraphs. In the Mapper machine We run a subgraph mining technique on each partition in parallel. Mapper i produces a set of locally frequent subgraphs. Pairs of s,support(s,part i (DB)). Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 33 / 50

Distributed subgraph mining in the cloud Distributed FSM step Distributed FSM step A single MapReduce job. Input: a set of partitions. Output: the set of globally frequent subgraphs. In the Mapper machine We run a subgraph mining technique on each partition in parallel. Mapper i produces a set of locally frequent subgraphs. Pairs of s,support(s,part i (DB)). In the Reducer machine We compute the set of globally frequent subgraphs Pairs of s, Support(s, DB). No false positives generated. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 33 / 50

Distributed subgraph mining in the cloud Distributed FSM step Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 34 / 50

Distributed subgraph mining in the cloud Experiments Implementation platform Hadoop 0.20.1 release, an open source version of MapReduce. A local cluster with five nodes. A Quad-Core AMD Opteron(TM) Processor 6234 2.40 GHz CPU. 4 GB of memory. Three existing subgraph miners: gspan, FSG and Gaston. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 35 / 50

Distributed subgraph mining in the cloud Experiments Implementation platform Hadoop 0.20.1 release, an open source version of MapReduce. A local cluster with five nodes. A Quad-Core AMD Opteron(TM) Processor 6234 2.40 GHz CPU. 4 GB of memory. Three existing subgraph miners: gspan, FSG and Gaston. Datasets Six datasets composed of synthetic and real ones. Different parameters such as: the number of graphs, the average size of graphs in terms of edges and the size on disk. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 35 / 50

Distributed subgraph mining in the cloud Experiments Table: Experimental data. Dataset Type Number of graphs Size on disk Average size DS1 Synthetic 20,000 18 MB [50-100] DS2 Synthetic 100,000 81 MB [50-70] DS3 Real 274,860 97 MB [40-50] DS4 Synthetic 500,000 402 MB [60-70] DS5 Synthetic 1,500,000 1.2 GB [60-70] DS6 Synthetic 100,000,000 69 GB [20-100] Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 36 / 50

Distributed subgraph mining in the cloud Experiments Experimental protocol Three types of experiments: 1 Quality: MRGP vs. DGP. Comparison with random sampling method. 2 Load balancing and execution time: Performance evaluation tests. Scalability tests. 3 Impact of MapReduce parameters. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 37 / 50

Distributed subgraph mining in the cloud Experiments: Quality gspan, θ = 30% gspan, θ = 50% Table: Number of false positives of the sampling method. Dataset DS1 DS2 DS3 Support θ (%) Number of subgraphs gspan FSG Gaston Number of Number of Number of Number of false positives subgraphs false positives subgraphs Number of false positives 30 4421 4078 4401 4078 4401 4078 50 194 155 174 153 174 153 30 164 139 144 58 144 58 50 29 4 12 4 12 4 30 264 195 258 193 258 193 50 62 30 59 30 59 30 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 38 / 50

Distributed subgraph mining in the cloud Experiments: Quality Result quality Distributed FSM vs. classic one. Low values of loss rate with DGP. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 39 / 50

Distributed subgraph mining in the cloud Experiments: Load balancing and execution time Runtime and workload distribution DGP enhances the performance of our approach. Balanced workload distribution over the distributed machines. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 40 / 50

Distributed subgraph mining in the cloud Experiments: Impact of MapReduce parameters Chunk size and replication factor High runtime values with small chunk size. The runtime is inversely proportional to the replication factor. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 41 / 50

Prospects Outline 1 Background 2 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 42 / 50

Prospects Outline 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 43 / 50

Prospects At a glance A MapReduce-based framework for distributing FSM in the cloud. Many partitioning techniques of the input graph database. Many subgraph extractors. A data partitioning technique that considers data characteristics. It uses the density of graphs. Balanced computational load over the distributed machines. Experiment validation. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 44 / 50

Prospects Outline 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Distributed subgraph mining in the cloud 3 Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 45 / 50

Prospects Prospects Improvements of the cloud-based FSM approach Different topological graph properties. Relation between database characteristics and the choice of the partitioning technique. Open questions What is the maximum number of buckets and/or partitions? What is the size of chunk to use in the partitioning step and in the distributed subgraph mining step? Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 46 / 50

Prospects Prospects Performance and scalability improvement Runtime improvement with task and node failures. Ensure minimal loss of information in the case of failures. Portability improvement Extension of our approach to SPARK, SHARK, Open Computing Language (OpenCL) and Message Passing Interface (MPI). Deployment of the approach Study the integration of our approach to recent distributed machine learning toolkits such as the Apache Mahout project and SystemML. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 47 / 50

Prospects Work in progress Cost models Cost models for distributing frequent pattern mining in the cloud. Application to distributed frequent subgraphs. Objective functions that consider the needs of customers: Budget limit, Response time limit, and Result quality limit. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 48 / 50

Prospects Publications Journals S. Aridhi, L. d Orazio, M. Maddouri et E. Mephu Nguifo. Un partitionnement basé sur la densité de graphe pour approcher la fouille distribuée de sous-graphes fréquents. Techniques et Science Informatiques. (Accepted) S. Aridhi, L. d Orazio, M. Maddouri and E. Mephu Nguifo. Density-based data partitioning strategy to approximate large scale subgraph mining. Information Systems, Elsevier, ISSN 0306-4379, http://dx.doi.org/10.1016/j.is.2013.08.005, 2014. (In press) Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 49 / 50

Thank You! Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 50 / 50