Management & Analysis of Big Data in Zenith Team



Similar documents
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Chapter 7. Using Hadoop Cluster and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

HiBench Introduction. Carson Wang Software & Services Group

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

COURSE CONTENT Big Data and Hadoop Training

Apache Hadoop. Alexandru Costan

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Cluster Applications

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

GraySort and MinuteSort at Yahoo on Hadoop 0.23

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data Analytics - Accelerated. stream-horizon.com

NoSQL and Hadoop Technologies On Oracle Cloud

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Ecosystem B Y R A H I M A.

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Best Practices for Hadoop Data Analysis with Tableau

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data and Scripting map/reduce in Hadoop

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

BIG DATA What it is and how to use?

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Log Mining Based on Hadoop s Map and Reduce Technique

How To Handle Big Data With A Data Scientist

BIG DATA SOLUTION DATA SHEET

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Extending Hadoop beyond MapReduce

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to DISC and Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A Performance Analysis of Distributed Indexing using Terrier

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop. Sunday, November 25, 12

Big Data and Apache Hadoop s MapReduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hama Design Document v0.6

Architectures for Big Data Analytics A database perspective

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Large scale processing using Hadoop. Ján Vaňo

Internals of Hadoop Application Framework and Distributed File System

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

International Journal of Innovative Research in Computer and Communication Engineering

Map Reduce & Hadoop Recommended Text:

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop Parallel Data Processing

Ubuntu and Hadoop: the perfect match

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Cloud Computing at Google. Architecture

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Apache Hadoop Ecosystem

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data Processing with Google s MapReduce. Alexandru Costan

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Big Data in the Enterprise: Network Design Considerations

Architectures for massive data management

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Job Oriented Training Agenda

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Constructing a Data Lake: Hadoop and Oracle Database United!

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

16.1 MAPREDUCE. For personal use only, not for distribution. 333

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data on Microsoft Platform

Cloudera Certified Developer for Apache Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Transcription:

Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM

Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence Mining Frequent Itemset Mining Conclusion

MapReduce : A Framework for Big Data Processing Introduced by Google to support parallel processing of highly distributable problems, e.g. PageRank Hadoop: an open source (Apache) distribution of MR Used by many companies for Big Data processing, e.g. Yahoo, Facebook Programming principle: Map step: The input data is divided to smaller split, and each split is processed by a map worker to produce a set of intermediate key-values Reduce step: all values of each key are transferred to one reduce worker where a reduce function is applied on them to produce the final results Shuffle: process of sorting, grouping and transferring intermediate key-values from map to reduce workers.

MapReduce Architecture

Dealing With Data Skew in MapReduce Problem: Poor performance in case of skew in reduce phase In some applications, a high percentage of values is processed by one reducer E.g. Top-k, Skyline, some Aggregate queries

Our Solution: FP-Hadoop Main idea: an intermediate reduce (IR) phase An IR function (e.g. combiner) is executed over blocks of intermediate values Using all workers of the system Features of IR phase Collaborative reducing of each key Hierarchical execution plans Optimized scheduling of distributed blocks In contrast to combinerf unction, the IR function is applied over distributed intermediate blocks

Data Flow in FP-Hadoop

Test platform: Grid5000 Performance Evaluation FP-Hadoop : up to 10x faster than Hadoop (MR) in reduce time and 5x in total execution time

Data Partitioning for Reducing Data Transfer in MapReduce The shuffle phase may involve large data transfers Because each mapper sends high data volumes to each reducer Result: high response time of some jobs because of slow shuffle Ideal case: Values of each key are produced at one map worker, and are reduced by the same worker Worker 1 Map0 Map0 Map1 Reduce 0 Map1 Reduce 0 Worker 2 Map2 Map2 Map3 Reduce 1 Map3 Reduce 1 Normal situation Ideal situation

Our Contribution MR-Part: A new approach for minimizing data transfers in MapReduce Implemented on top of Hadoop

Experiments Environment Grid5000 Comparison Native Hadoop (NAT) Hadoop + reduce locality-aware scheduling (RLS) MR-Part (MRP) Benchmark TPC-H, MapReduce version Parameters Data size, cluster size, bandwidth Metrics Transferred data Latency (response time)

Performance Evaluation Percentage of transferred data

Frequent Sequences: TeraBytes of data ongoing work with Bull - Bull s clusters (e.g. CURIE) need careful management and real-time monitoring. - Clusters nodes send lots of messages in log files that: - Cannot be explored by humans (hundreds of Tera-bytes) - Zenith is designing massively distributed data mining methods that scale for analyzing this huge data - Patterns discovered from these logs will feed rule bases that allow monitoring the clusters and trigger alarm in case of possible anomaly.

Frequent Sequences: TeraBytes Monitoring of data ongoing work with Bull Message logs (hundreds of Tera-bytes) 47 2013 Jun 30 03:29:07 kay0 daemon info named error 1#53 47 2013 Jun 30 03:29:07 kay0 daemon info named error 2#53 48 2013 Jun 30 03:29:07 kay0 daemon info named error 1#53 49 2013 Jun 30 03:29:09 kay0 daemon info named error 5#53 50 2013 Jun 30 03:29:09 kay0 daemon info named error 5#53 50 2013 Jun 30 03:29:09 kay475 syslog err syslog-ng I/O 50 2013 Jun 30 03:29:09 kay475 syslog notice syslog-ng Error 51 2013 Jun 30 03:29:09 kay475 syslog notice syslog-ng Suspending 52 2013 Jun 30 03:29:10 kay0 daemon info named error 5#53 53 2013 Jun 30 03:29:10 kay0 daemon info named error 5#53 53 2013 Jun 30 03:29:10 kay0 daemon info named error 5#53 53 2013 Jun 30 03:29:11 kay0 daemon info named error f#53 Patterns Rules Daemon error + syslog error -> Suspending User notice + IO warning -> syslog error

Data Partitioning for Frequent Itemset Mining ongoing work Improving frequent itemset mining algorithms based on input data partitioning Mappers of MapReduce work on partitions A new data partitioning technique: Item based data partitioning Objective: Mining several Terabytes of data (Clouweb)

Hadoop_g5k: a Tool for Easy Hadoop Deployment in Grid5000 Overview Initiated by Miguel Liroz (research engineer in Zenith) Scripts and APIs to deploy Hadoop applications in G5K Based on execo library Documentation available in G5K wiki, and sources in GitHub Features Manages whole life-cycle of Hadoop clusters Supports several versions of Hadoop and hides configuration details Automatic configuration based on best practices Topology, number of slots, memory per task, etc.

Conclusion Big Data Processing and Analysis Dealing with skew in big data processing Data partitioning for reducing network traffic in MapReduce framework Error pattern detection in Bull super-computer logst Large scale frequent itemset mining To evaluate all these solutions We use Grid5000 Requirement: more storage capacity

Questions? Web Site: search Zenith Team in Google Email: Reza.Akbarinia@inria.fr

hg5k An example Creation + installation + initialization/start Job execution Destruction hg5k --create $OAR_FILE_NODES hg5k --bootstrap $LIB_HOME/hadoop-1.2.1.tar.gz hg5k --initialize start hg5k --jarjob $LIB_HOME/hadoop-test-1.2.1.jar mrbench hg5k --delete

hadoop_engine Features Test automatization for Hadoop based experiments Optimizes dataset creation and/or deployment Generates ds/xp configuration + statistics for all combinations (to appear) Automatic generation of figures from results Example (from wiki) [test_parameters] test.summary_file =./test/summary.csv test.ds_summary_file =./test/ds-summary.csv test.stats_path =./test/stats [ds_parameters] ds.class = hadoop_g5k.dataset.staticdataset ds.class.local_path = datasets/ds1 ds.dest = ${data_dir} ds.size = 1073741824, 2147483648 # 1 2 GB Where results are stored Dataset configuration [xp_parameters] io.sort.factor = 10, 100 xp.combiner = true, false Experiments configuration xp.job = program.jar ${xp.combiner} other_job_options ${xp.input} ${xp.output}