MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Similar documents

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Hadoop Parallel Data Processing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source Google-style large scale data analysis with Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Intro to Map/Reduce a.k.a. Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data With Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Parallel Processing of cluster by Map Reduce

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

NoSQL and Hadoop Technologies On Oracle Cloud

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Apache Hadoop s MapReduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

A very short Intro to Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

MapReduce and Hadoop Distributed File System

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

GraySort and MinuteSort at Yahoo on Hadoop 0.23

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Open source large scale distributed data management with Google s MapReduce and Bigtable

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Internals of Hadoop Application Framework and Distributed File System

Data Mining with Hadoop at TACC

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

How To Use Hadoop

Hadoop IST 734 SS CHUNG

HDFS. Hadoop Distributed File System

Introduction to Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Developing MapReduce Programs

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Ecosystem B Y R A H I M A.

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Reduction of Data at Namenode in HDFS using harballing Technique

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Apache Hadoop. Alexandru Costan

Architectures for massive data management

MapReduce with Apache Hadoop Analysing Big Data

Map Reduce & Hadoop Recommended Text:

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

MapReduce. Tushar B. Kute,

Distributed Computing and Big Data: Hadoop and MapReduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

HDFS: Hadoop Distributed File System

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to MapReduce and Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Networking in the Hadoop Cluster

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Understanding Hadoop Performance on Lustre

PassTest. Bessere Qualität, bessere Dienstleistungen!

How To Scale Out Of A Nosql Database

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Introduction to Hadoop

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Analysis of MapReduce Algorithms

Big Data and Scripting map/reduce in Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

L1: Introduction to Hadoop

Can the Elephants Handle the NoSQL Onslaught?

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

MapReduce and Hadoop Distributed File System V I J A Y R A O

Storage Architectures for Big Data in the Cloud

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Transcription:

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012

Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte scale Internet search indexes, analysis Google, yahoo facebook Recently: 8000 nodes sort 10PB in 6.5 hours Open source frameworks with different goals Hadoop, phoenix Lots of research in last 5 years Adapt scientific computation algorithms to MapReduce, performance analysis 1/19/2012 2

A programming model with some nice consequences Map(D) list(ki, Vi) Reduce(Ki, list(vi)) list(vf) Map: Apply a function to every member of dataset to produce a list of key-value pairs Dataset: set of values of uniform type D Image blobs, lines of text, individual points, etc Function: transforms each value into a list of zero or more key,value pairs of types Ki, Vi Reduce: Given a key and all associated values, do some processing to produce list of type Vf Execution over data is managed by a MapReduce framework 1/19/2012 3

Canonical example: Word Count D = lines of text Ki = Single Words Vi = Numbers Vf = Word/count pairs Map(D) list(ki, Vi) Reduce(Ki, list(vi)) list(vf) Map(D) = Emit pairs containing each word and the number 1 Reduce(Ki, list(vi)) = Sum all the numbers in the list associated with the given word. Emit the word and the resulting count 1/19/2012 4

Canonical example: Word Count absence of evidence (absence, 1) (of, 1) (absence, 1) (absence, 1) (absence, 2) (evidence, 1) (is, 1) (of, 1) (of, 1) (of, 2) is not evidence of (not, 1) (evidence, 1) (evidence, 1) (evidence, 1) (evidence, 2) (of, 1) (is, 1) (is, 1) absence (absence, 1) (not, 1) (not, 1) Map(D) list(ki, Vi) Reduce(Ki, list(vi)) list(vf) Somehow need to group by keys so Reduce can be given all associated values! 1/19/2012 5

Opportunities for Parallelism? absence of evidence (absence, 1) (of, 1) (absence, 1) (absence, 1) (absence, 2) (evidence, 1) (is, 1) (of, 1) (of, 1) (of, 2) is not evidence of (not, 1) (evidence, 1) (evidence, 1) (evidence, 1) (evidence, 2) (of, 1) (is, 1) (is, 1) absence (absence, 1) (not, 1) (not, 1) Promising Worrisome Promising 1/19/2012 6

Opportunities for Parallelism Map and Reduce functions are independent No explicit communication between them Grouping phase between Map and Reduce is the only point of data exchange Individual Map, Reduce results depend only on input value. Order of data, execution does not matter in the end. Input data read in parallel Output data written in parallel 1/19/2012 7

Parallel, Distributed execution absence of evidence is not evidence of absence absence of evidence is not evidence of absence (absence, 1) (of, 1) (evidence, 1) (absence, 1) (absence, 1) (is, 1) (not, 1) (evidence, 1) (of, 1) (absence, 1) (not, 1) (of, 1) (is, 1) (evidence, 1) (of, 1) (evidence, 1) (absence, 2) (not, 1) (of, 2) (is, 1) (evidence, 2) Map Reduce (absence, 2) (is, 1) (not, 1) (evidence, 2) 1/19/2012 (of, 2) 8

Full Parallel Pipeline Read Map Group Reduce Write Split (Combine) Partition 1/19/2012 9

Full Parallel Pipeline Split Divide data into parallel streams Use features of underlying storage technology File sharding, locality information, parallel data formats 1/19/2012 10

Full Parallel Pipeline Read Chop data into iterable units Most common in MapReduce world Lines of Text Can be arbitrary simple or complex integer arrays, pdf documents, mesh fragments, etc. 1/19/2012 11

Full Parallel Pipeline Map Apply a function, return a list of keys/values 1/19/2012 12

Full Parallel Pipeline Combine (optional) execute a mini-reduce on some set of map output For optimization purposes May not be possible for every algorithm 1/19/2012 13

Full Parallel Pipeline Group Group all results by key, collapse into a list of values for each key Need all intermediate values before this can complete Automatically performed by MapReduce framework 1/19/2012 14

Full Parallel Pipeline Partition Send grouped data to reduce processes Typically, just a dumb hash to evenly distribute Opportunities for balancing or other optimization. 1/19/2012 15

Full Parallel Pipeline Reduce Run a computation over each aggregated result, produce a final list of values 1/19/2012 16

Full Parallel Pipeline Write Move Reduce results to their final destination Could be storage, or another MapReduce process! 1/19/2012 17

Programming considerations You must provide: Map, Reduce functions You may provide: Combine, if it helps Partition function, if it matters Framework must provide: Grouping and data shuffling Framework may provide: Read, Write For simple data such as lines of text Split For parallel storage or data formats it knows about 1/19/2012 18

Benefits Presents an easy-to-use programming model No synchronization, communication by individual components. Ugly details hidden by framework. Execution managed by a framework Failure recovery (Maps/Reduces can always be re-run if necessary) Speculative execution (Several processes operate on same data, whoever finishes first wins) Load balancing Adapt and optimize for different storage paradigms 1/19/2012 19

Drawbacks Grouping/partitioning is serial! Need to wait for all map tasks to complete before any reduce tasks can be run Some algorithms may be hard to conceptualize in MapReduce. Some algorithms may be inefficient to express in terms of Map Reduce 1/19/2012 20

Hadoop Open Source MapReduce framework in Java Spinoff from Nuch web crawler project HDFS Hadoop Distributed Filesystem Distributed, fault-tolerant, sharding Many sub-projects Pig: Data-flow and execution language. Scripting for MapReduce Hive: SQL-like language for analyzing data Mahout: Machine learning and data mining libraries K-means clustering, Singular Value Decomposition, Bayesian classification 1/19/2012 21

Hadoop User provides java classes for Map, Reduce functions Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling Streaming mode to STDIN, STDOUT of external map, reduce processes (can be implemented in any language) Lots of scientific data that goes beyond lines of text Lots of existing/legacy code that can be adapted/wrapped into a Map or Reduce stage. stream -input /datadir/datafile -file mymapper.sh -mapper mymapper.sh" -file myreducer.sh -reducer myreducer.sh" -output /datadir/myresults 1/19/2012 22

HDFS Data distributed among compute nodes Sharding: 64MB chunks Redundancy Small number of large files Not quite POSIX file semantics No random write, append Write-once read many Favor throughput over latency Streaming/sequential access to files 1/19/2012 23

HDFS NameNode DataNode 1 3 4 1 2 3 DataNode DataNode 1 2 3 3 Replication 4 2 4 Sharding DataNode 4 1/19/2012 24 1 2

HDFS + MapReduce 1 2 3 4 NameNode Locality metadata Split fn JobTracker 3 1 2 4 DataNode Map/Red 3 DataNode Map/Red 1 DataNode Map/Red 2 3 1 2 1 2 3 4 3 4 DataNode 4 Map/Red 1 4 1/19/2012 25 2

HDFS + MapReduce Assume failure-prone nodes Data and computation recovery through redundancy Move computation to data Data is local to computation, direct-attached storage to each node Sequential reads on large blocks Minimal contention Simultaneous maps/reduces on a node can be controlled by configuration 1/19/2012 26

Hadoop + HDFS vs HPC.. 3 1 4.... 2 1 3.... 3 2 4.... 4 1 2.. 1 2 3 4 1/19/2012 27

Hadoop in HPC environments Access to local storage can be problematic Local storage may not be available at all Even if so, long-term HDFS usually not possible HPC relies on global storage (e.g. Lustre) via highspeed interconnect. What is meaning of locality in inherently non-local (but parallel) storage? 1/19/2012 28

Hadoop @ TACC On Longhorn visualization cluster Special, local, persistent /hadoop filesystem on some machines 48 nodes with 2TB HDFS storage/node 16 nodes with 1TB HDFS storage/node, extra large memory (144GB memory) Modified hadoop distribution Starts HDFS on allocated nodes Special Hadoop queue By request only Details at https://sites.google.com/site/tacchadoop 1/19/2012 29

Dark Matter Halo Detection on Longhorn 200 TB simulated astronomy data Explore algorithms for identifying halo candidates by analyzing star density Compared compute parallel vs HDFS+MapReduce Data Parallel ~57k data points/node/hr for compute parallel ~600k data points/node/hr data parallel with MapReduce + HDFS 1/19/2012 30

BLAST at NERSC Streaming API Issues with data format. FASTA: sequence represented on multiple lines >some sort of header ACTGCATCATCATCATCAT GGGCTTACATCATCATCAT Lots of effort re-implementing basic data handling components (Reader, possibly Split) Eventually re-formatted data so each sequence was on own line Overall performance: Not significantly better than existing parallel methods 1/19/2012 31

Still much to learn Most established patterns are from web and text processing (inverted indexes, ranking, clustering, etc) Scientific data and algorithms much more varied Papers describing an existing problem applied to MapReduce are common When does HDFS provide benefit over traditional global shared FS? Tends to do poorly for small tasks, can be a crossover point that needs to be found Lots of tuning parameters Data skew and heterogeneity may lead to long, inefficient jobs. 1/19/2012 32

Why Hadoop? If you find the programming model simple/easy If you have a data intensive workload If you need fault tolerance If you have dedicated nodes available If you like Java If you want to experiment. 1/19/2012 33