Main Memory Map Reduce (M3R)



Similar documents
M3R: Increased Performance for In-Memory Hadoop Jobs

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop Parallel Data Processing

Hadoop Design and k-means Clustering

Professional Hadoop Solutions

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Advances in Natural and Applied Sciences

Distributed Computing and Big Data: Hadoop and MapReduce

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

PassTest. Bessere Qualität, bessere Dienstleistungen!

University of Maryland. Tuesday, February 2, 2010

I/O Considerations in Big Data Analytics

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop SNS. renren.com. Saturday, December 3, 11

How To Create An Image Processing Cloud Project

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Data-intensive computing systems

Lecture 3 Hadoop Technical Introduction CSE 490H

Cloudera Certified Developer for Apache Hadoop

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

BIG DATA HADOOP TRAINING

A very short Intro to Hadoop

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Spark and the Big Data Library

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Introduction to Spark

Internals of Hadoop Application Framework and Distributed File System

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

MapReduce with Apache Hadoop Analysing Big Data

MapReduce, Hadoop and Amazon AWS

Hadoop Ecosystem B Y R A H I M A.

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Fast Data in the Era of Big Data: Twitter s Real-

Big Data Management and NoSQL Databases

Introduction to DISC and Hadoop

Introduction to Cloud Computing

Cloud Computing using MapReduce, Hadoop, Spark

Report: Declarative Machine Learning on MapReduce (SystemML)

Open source large scale distributed data management with Google s MapReduce and Bigtable

Extreme Computing. Hadoop MapReduce in more detail.

Presto/Blockus: Towards Scalable R Data Analysis

Efficient Processing of XML Documents in Hadoop Map Reduce

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Optimization Techniques for Scaling Down Hadoop on Multi-Core, Shared-Memory Systems

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Peers Techno log ies Pv t. L td. HADOOP

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Apache Hama Design Document v0.6

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Spark: Cluster Computing with Working Sets

Using Lustre with Apache Hadoop

High Performance Computing. Course Notes HPC Fundamentals

Hadoop Cluster Applications

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data With Hadoop


How Lucene Powers LinkedIn Segmentation & Targeting Platform

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MAPREDUCE Programming Model

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Architecture. Part 1

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Big Data Too Big To Ignore

MR+ A Technical Overview. Ralph H. Castain Wangda Tan. Greenplum/EMC. Copyright 2012 EMC Corporation. All rights reserved.

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

CSE-E5430 Scalable Cloud Computing Lecture 2

xpaaerns on Spark, Shark, Tachyon and Mesos

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Learning at scale on Hadoop

Big Data and Apache Hadoop s MapReduce

Enabling High performance Big Data platform with RDMA

Distributed R for Big Data

Transcription:

Main Memory Map Reduce (M3R) PL Day, September 23, 2013 Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat, Wei Zhang In collaboralon with: Yan Li*, Dave Grove, Mikio Takeuchi**, Salikh Zakirov**, Juemin Zhang * IBM China Research Lab ** IBM Tokyo Research Lab Otherwise IBM TJ Watson Research Lab, New York 1

M3R (engine) Hadoop performance low latency resilience scalability X10 Java in- memory out- of- core

X10 Java - like language designed for performance and produclvity at scale Asynchronous ParLLoned Global Address Space programming model async S: run S as a separate ac/vity at (P) S: run S at place P finish S: wait for termina/on of children ac/vi/es MPI style barriers, local atomic synchronizalon Advanced type system reified generics, closures, dependent types MulLple backends: Java, C++, CUDA hap://x10- lang.org/

Main- Memory Map Reduce in X10 sorlng (keys) mull- threading secondary sorlng (values) iteralve jobs combiners debugging map- only jobs out of core shuffle profiling user controlled serializalon

M3R/Hadoop Architecture Java Hadoop App X10 M3R jobs mullple jobs mullple jobs M3R/Hadoop adaptor Hadoop Map Reduce Engine M3R Engine JVM only JVM/Na/ve HDFS HDFS HDFS data Java M3R jobs X10 Java

CompaLbility M3R / Hadoop Performance

Hadoop Job File System (HDFS) Input Map Reduce Output (InputFormat/ (Mapper) (Reducer) (OutputFormat/ RecordReader/ InputSplit) RecordWriter File System Shuffle File System 2013 IBM CorporaLon OutputCommi:er)

M3R/Hadoop Job: cache File System (HDFS) Cache Input Map Reduce Output (InputFormat/ (Mapper) (Reducer) (OutputFormat/ RecordReader/ RecordWriter InputSplit) File System Shuffle File System OutputCommi:er)

M3R/Hadoop Job: in- memory File System (HDFS) Cache Input Map Reduce Output (InputFormat/ (Mapper) (Reducer) (OutputFormat/ RecordReader/ RecordWriter InputSplit) File System Shuffle File System OutputCommi:er)

M3R/Hadoop Job: co- localon File System (HDFS) Cache Input Map Reduce Output (InputFormat/ (Mapper) (Reducer) (OutputFormat/ RecordReader/ RecordWriter InputSplit) File System Shuffle File System OutputCommi:er)

Iterated Matrix Vector mullplicalon V Algorithm ( standard HPC ) Row block parllon G Replicate V In parallel, at each place, mullply each row of G with V. In parallel, each place broadcasts its segment of V to all others This reassembles V for next phase. G V Performance key Read the appropriate part of G once, never communicate it. Reassembly is local.

M3R/Hadoop Job: locality File System (HDFS) Cache Input Map Reduce Output (InputFormat/ (Mapper) (Reducer) (OutputFormat/ RecordReader/ RecordWriter InputSplit) ParLLoner Shuffle OutputCommi:er)

ParLLon Stability in M3R The reducer associated with a given parllon number will always be run at the same place Assuming the number of reducers and the number of places remains the same, The number of reducers is determined by the applicalon. The number of places is fixed for the duralon of the M3R server.

Sparse Matrix Vector MulLplicaLon 2000 M3R/Hadoop 1800 Hadoop Expon. (M3R/Hadoop) 1600 Expon. (Hadoop) Sparse Matrix Vector Multiplication 1400 Sparse Matrix Vector Multiplication 45 40 35 Time (s) 1200 M3R/Hadoop 1000 Expon. (M3R/Hadoop) 800 Time (s) 30 25 20 15 600 400 200 0 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 Size M (G is an MxM matrix with sparsity 0.001) 10 5 0 50X 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 Size M (G is an MxM matrix of sparsity 0.001)

CompaLbility M3R / Hadoop Performance

JobTracker

CompaLbility M3R / Hadoop Performance

DML results (Nov. 2011) GNNMF Hadoop M3R Speedup 100K 1489s 115s 13x 200K 1492s 185s 8.1x 400K 1481s 300s 4.9x Linear Regression Hadoop M3R Speedup 1000K 1272s 120s 10.6x 3000K 1438s 185s 7.8x 5000K 1473s 275s 5.4x PageRank Hadoop M3R Speedup 100K 880s 452s 1.9x 200K 885s 574s 1.5x 400K 872s 530s 1.7x

Pig unit tests

Current Status / Future Work VLDB 12 Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. 2012. M3R: increased performance for in- memory Hadoop jobs. Proc. VLDB Endow. 5, 12 (August 2012), 1736-1747. Things generally work quite well Working on out- of- core shuffle Performance degradalon instead of crashing Working on dynamic class loading