Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang



Similar documents
Architectures for Big Data Analytics A database perspective

Developing MapReduce Programs

Estimating PageRank Values of Wikipedia Articles using MapReduce

A Comparison of Approaches to Large-Scale Data Analysis

Distributed R for Big Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Fast Analytics on Big Data with H20

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Advanced Big Data Analytics with R and Hadoop

Spark: Cluster Computing with Working Sets

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Data processing goes big

Report: Declarative Machine Learning on MapReduce (SystemML)

Bringing Big Data Modelling into the Hands of Domain Experts

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010


Apache Hadoop. Alexandru Costan

Apache Hama Design Document v0.6

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Spark and the Big Data Library

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Hadoop Parallel Data Processing

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Unified Big Data Processing with Apache Spark. Matei

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Scaling Out With Apache Spark. DTL Meeting Slides based on

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

A programming model in Cloud: MapReduce

Big Fast Data Hadoop acceleration with Flash. June 2013

I/O Considerations in Big Data Analytics

Similarity Search in a Very Large Scale Using Hadoop and HBase

Distributed Framework for Data Mining As a Service on Private Cloud

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

BIG DATA What it is and how to use?

Management & Analysis of Big Data in Zenith Team

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Cloud Computing at Google. Architecture

Open source Google-style large scale data analysis with Hadoop

Integrating Apache Spark with an Enterprise Data Warehouse

HPC performance applications on Virtual Clusters

Big data in R EPIC 2015

Lecture Data Warehouse Systems

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data Systems CS 5965/6965 FALL 2015

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Innovative technology for big data analytics

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

A survey of big data architectures for handling massive data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

2015 The MathWorks, Inc. 1

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

On a Hadoop-based Analytics Service System

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Whitepaper: performance of SqlBulkCopy

Developing a MapReduce Application

Chapter 7. Using Hadoop Cluster and MapReduce

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Vertica Live Aggregate Projections

Understanding Hadoop Performance on Lustre

Parallel Data Warehouse

An Approach to Implement Map Reduce with NoSQL Databases

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Big Data and Big Analytics

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Introduction to DISC and Hadoop

A Brief Introduction to Apache Tez

Hadoop & Spark Using Amazon EMR

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Energy Efficient MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Supported Platforms HPE Vertica Analytic Database. Software Version: 7.2.x

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

COURSE CONTENT Big Data and Hadoop Training

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Price Comparison ProfitBricks / AWS EC2 M3 Instances

Jeffrey D. Ullman slides. MapReduce for data intensive computing

L3: Statistical Modeling with Hadoop

Stress-Testing Clouds for Big Data Applications

Classification On The Clouds Using MapReduce

How Companies are! Using Spark

Big Data and Data Science: Behind the Buzz Words

Distributed R for Big Data

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Map-Reduce for Machine Learning on Multicore

Transcription:

Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang

Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion

Overview * What we have done? * Compared different platforms to process graph mining algorithm * Evaluated the usability of each platform * Analyzed the results of these experiments

Approaches & Environment * Approaches * Algorithms: Degree Distribution, Weakly Connected Component, PageRank * Dataset: 2.7G Kronecker graph produced by method introduced in [1] * Environment * AWS EC2 (us- east- 1/m3.x2large) * 1, 3, 6 nodes * Ubuntu 12.04 LTS 64bit * Systems tested: HP Vertica, SciDB, Apache Hadoop, PostgreSQL [1] "Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. by Leskovec, Jurij, et al.

Approaches & Environment * How we implemented these algorithms? * Vertica: * Use Java as host language(vertica s UDF does not support loops) * PostgreSQL: * Use plpgsql language to define UDFs (Logics and main sql languages are the same as the vertica version) * Hadoop: * Degree Distribution: one MapReduce Job * Weakly Connected Component: three phases[initialization, computation(iterations), final] * PageRank: similar to weakly connected component * SciDB: * Array- based matrix multiplication

Approaches & Environment * Special Notes * Why we only implement one node version on PostgreSQL? * Why we don t implement one node version on Hadoop? * Why we only implement one node and three node version on Vertica? * Why we choose Datasets with 2.7 GB?

Results Time(seconds) 1400 1200 1000 800 600 400 200 0 Loading Time 1 node 3 nodes 6 nodes Hadoop Vertica Scidb PostgreSQL

Observations * Loading time * Hadoop has the most efficient loading efficiency. * Vertica also has a huge advantage over others * SciDB and PostgreSQL performs really bad * Analysis: * Hadoop can load data without extra operation; It simply divides the file into chunks and replicates them. * Vertica now has a Hybrid Storage Model (WOS, TM and ROS) in which WOS is optimized for data updating. * PostgreSQL manages a temporary buffer for data loading which is very small by default so it incorporates many disk writes. * SciDB supports parallel loading with multiple instances, but it is quite slow. The user guide does not provide in- situ data processing.

Results 250 Degree Distribution Time(seconds) 200 150 100 50 Hadoop Vertica Scidb PostgreSQL 0 1 node 3 nodes 6 nodes

Observations * Degree Distribution * Hadoop performs badly and the performance does not increase linearly in this case * Vertica can finish this task within seconds * Although having the same logic, postgresql is less efficient than Vertica * SciDB takes minutes to finish the work due to its logic storage manner. * Analysis: * Hadoop has a huge launching overhead and it involves many disk writes and net work communications to do any simple job * Vertica is a read optimized column store. (compression strategy, only need to access the first column) * PostgresSQL needs to access the whole row * SciDB has to conduct redimension on data and store the temp data.

Results Time(seconds) 10000 8000 6000 4000 2000 0 Weakly Connected Component 1 node 3 nodes 6 nodes Hadoop Vertica PostgreSQL SciDB

Observations * Weakly Connected Component * Hadoop takes hours(3 nodes) or nearly an hour(6 nodes) to finish the job * Vertica finishes this task in 200 seconds(1 node) or 140 seconds(3 nodes) to finish the job * PostgreSQL takes nearly 3 hours to finish the job * SciDB also takes 2 hours to finish the job * Analysis: * Hadoop has many iterations and each iteration will start a new MapReduce job. The launching, disk I/O and network communication overhead is huge. * The SQLs for doing this job has many deletes, inserts and updates. Vertica performs well because of WOS which is a memory- resident data structure. WOS has no compression and indexing so it enables fast data update * The PostgreSQL performs bad because of the disk I/Os even if we increase the temp buffer size. It is still not sufficient to store the huge intermediate temporary table. * In SciDB, we implemented WCC based on adjacent matrix, and conduct join operations on dimensions. It performs better than PostgreSQL

Results Time(seconds) 60000 50000 40000 30000 20000 10000 0 PageRank 1 node 3 nodes 6 nodes Hadoop Vertica PostgreSQL SciDB

Observations * PageRank * Hadoop performs bad but scales well * Vertica finishes this task in 20 minutes(1 node) or 10 minutes(3 nodes) to finish the job * PostgreSQL takes more than half a day to finish * SciDB takes near 3 hours to finish the tasks * Analysis: * Hadoop scales well because of the independence among Mappers and Reducers * The SQLs for doing this job has mostly create and drop tables and join operations. Vertica does not arrange data as tables but projections so this task it perform less efficient than weakly connected component. But the pre- join projection provides improvements for join operation. * The PostgreSQL performs bad because of the same reason: insufficient temp buffer to contain the whole temporary table, many disk I/Os. * Similar to WCC, we implemented PageRank based on adjacent matrix, and conduct matrix and vector multiplication to update new pagerank value in each iteration.

Observations * Usability * Hadoop * easy to setup and program * Vertica * has full documentation but also has some minor inaccuracies. * It use almost the same SQL semantics as other traditional relational DBMSs * It supports simple UDFs and UDXs which are also user- defined functions defined by other languages through Vertica SDK * PostgreSQL * It supports complex UDFs * SciDB * It gives user more freedom and more challenge on designing arrays * The system are decoupled so it is easy to add and remove nodes * It s tricky to implement an efficient graph mining algorithm with array based schema.

Notes * We tried Giraph & Hama, but failed * Documents have bugs and are not complete * Community is small * Very hard to start a cluster * Usability is important

Conclusion * Hadoop performs bad when working with graph related problems (iterations) * Vertica has strong power to process massive updates through its hybrid data model * PostgreSQL, as a traditional relational database, is not suitable for large graph related problems (single node restriction) * SciDB has specific design in array- based data, but not expressive enough in complex graph mining algorithm.