Data-Intensive Computing with Hadoop. Thanks to: Milind Bhandarkar <milindb@yahoo-inc.com> Yahoo! Inc.



Similar documents
HowTo Hadoop. Devaraj Das

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Programming in Hadoop Programming, Tuning & Debugging

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Apache Hadoop. Alexandru Costan

Data-intensive computing systems

MapReduce, Hadoop and Amazon AWS

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

HADOOP PERFORMANCE TUNING

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop & its Usage at Facebook

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop & its Usage at Facebook

Hadoop Architecture. Part 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Parallel Processing of cluster by Map Reduce

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

CSE-E5430 Scalable Cloud Computing Lecture 2

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Streaming. Table of contents

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data With Hadoop

MapReduce. Tushar B. Kute,

Open source Google-style large scale data analysis with Hadoop

Big Data Management and NoSQL Databases

Apache Hadoop FileSystem and its Usage in Facebook

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Use Hadoop

Internals of Hadoop Application Framework and Distributed File System

Hadoop Parallel Data Processing

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

A very short Intro to Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

University of Maryland. Tuesday, February 2, 2010

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed Computing and Big Data: Hadoop and MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CS54100: Database Systems

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Introduction to MapReduce and Hadoop

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Distributed File Systems

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Distributed File Systems

Design and Evolution of the Apache Hadoop File System(HDFS)

The Hadoop Framework

MapReduce and Hadoop Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Processing with Google s MapReduce. Alexandru Costan

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloudera Certified Developer for Apache Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Distributed File System V I J A Y R A O

Complete Java Classes Hadoop Syllabus Contact No:

HDFS Architecture Guide

Big Data and Scripting map/reduce in Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Intro to Map/Reduce a.k.a. Hadoop

Large scale processing using Hadoop. Ján Vaňo

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

NoSQL and Hadoop Technologies On Oracle Cloud

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Apache Hadoop FileSystem Internals

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

HDFS Users Guide. Table of contents

MAPREDUCE Programming Model

Hadoop and Map-Reduce. Swati Gore

Generic Log Analyzer Using Hadoop Mapreduce Framework

Developing MapReduce Programs

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Transcription:

Data-Intensive Computing with Hadoop Thanks to: Milind Bhandarkar <milindb@yahoo-inc.com> Yahoo! Inc.

Agenda Hadoop Overview HDFS Programming Hadoop Architecture Examples Hadoop Streaming Performance Tuning Debugging Hadoop Programs 2

Hadoop overview Apache Software Foundation project Framework for running applications on large clusters Modeled after Google s MapReduce / GFS framework Implemented in Java Includes HDFS - a distributed filesystem Map/Reduce - offline computing engine Recently: Libraries for ML and sparse matrix comp. Y! is biggest contributor Young project, already used by many 3

Hadoop clusters It s used in clusters with thousands of nodes at Internet services companies 4

Who Uses Hadoop? Amazon/A9 Facebook Google IBM Intel Research Joost Last.fm New York Times PowerSet Veoh Yahoo! 5

Hadoop Goals Scalable Petabytes (1015 Bytes) of data on thousands on nodes Much larger than RAM, even single disk capacity Economical Use commodity components when possible Lash thousands of these into an effective compute and storage platform Reliable In a large enough cluster something is always broken Engineering reliability into every app is expensive 6

Sample Applications Data analysis is the core of Internet services. Log Processing Reporting Session Analysis Building dictionaries Click fraud detection Building Search Index Site Rank Machine Learning Automated Pattern-Detection/Filtering Mail spam filter creation Competitive Intelligence What percentage of websites use a given feature? 7

Problem: Bandwidth to Data Need to process 100TB datasets On 1000 node cluster reading from remote storage (on LAN) Scanning @ 10MB/s = 165 min On 1000 node cluster reading from local storage Scanning @ 50-200MB/s = 33s-8 min Moving computation to the data enables I/O bandwidth scaling Network is the bottleneck Data size is reduced by the processing Need visibility into data placement 8

Problem: Scaling Reliably is Hard Need to store Petabytes of data On 1000s of nodes, MTBF < 1 day Many components disks, nodes, switches,... Something is always broken Need fault tolerant store Handle hardware faults transparently Provide reasonable availability guarantees 9

Hadoop Distributed File System Fault tolerant, scalable, distributed storage system Designed to reliably store very large files across machines in a large cluster Data Model Data is organized into files and directories Files are divided into uniform sized blocks and distributed across cluster nodes Blocks are replicated to handle hardware failure Corruption detection and recovery: Filesystem-level checksuming HDFS exposes block placement so that computes can be migrated to data 10

HDFS Terminology Namenode Datanode DFS Client Files/Directories Replication Blocks Rack-awareness 11

HDFS Architecture Similar to other NASD-based DFSs Master-Worker architecture HDFS Master Namenode Manages the filesystem namespace Controls read/write access to files Manages block replication Reliability: Namespace checkpointing and journaling HDFS Workers Datanodes Serve read/write requests from clients Perform replication tasks upon instruction by Namenode 12

Interacting with HDFS User-level library linked into the application Command line interface 13

Map-Reduce overview Programming abstraction and runtime support for scalable data processing Scalable associative primitive: Distributed GROUP-BY Observations: Distributed resilient apps are hard to write Common application pattern - Large unordered input collection of records - Process each record - Group intermediate results - Process groups Failure is the common case 14

Map-Reduce Application writer specifies A pair of functions called Map and Reduce A set of input files Workflow Generate FileSplits from input files, one per Map task Map phase executes the user map function transforming input records into a new set of kv-pairs Framework shuffles & sort tuples according to their keys Reduce phase combines all kv-pairs with the same key into new kv-pairs Output phase writes the resulting pairs to files All phases are distributed among many tasks Framework handles scheduling of tasks on cluster Framework handles recovery when a node fails 15

Hadoop MR - Terminology Job Task JobTracker TaskTracker JobClient Splits InputFormat/RecordReader 16

Hadoop M-R architecture Map/Reduce Master Job Tracker Accepts Map/Reduce jobs submitted by users Assigns Map and Reduce tasks to Task Trackers Monitors task and Task Tracker status, re-executes tasks upon failure Map/Reduce Slaves Task Trackers Run Map and Reduce tasks upon instruction from the Job Tracker Manage storage and transmission of intermediate output 17

Map/Reduce Dataflow 18

M-R Example Input: multi-tb dataset Record: Vector with 3 float32_t values Goal: frequency histogram of one of the components Min and max are unknown, so are the bucket sizes 19

M-R Example (cont.) Framework partitions input into chunks of records Map function takes a single record - Extract desired component v - Emit the tuple (k=v, 1) Framework groups records with the same k. Reduce function receives a list of all the tuples where for a given k - Sum the value (1) for all the tuples - Emit the tuple (k=v, sum) 20

M-R features There s more to it than M-R: Map-Shuffle-Reduce Custom input parsing and aggregate functions Input partitioning & task scheduling System support: Co-location of storage & computation Failure isolation & handling 21

Hadoop Dataflow (I 2 O) I InputSplit O 0..r-1 Reduce I 0..m-1 R 0..r-1 Map Copy/Sort/Merge M 0..m-1 Partition M 0..m-1 R 0..r-1

Input => InputSplits Input specified as collection of paths (on HDFS) JobClient specifies an InputFormat The InputFormat provides a description of splits Default: FileSplit Each split is approximately DFS s block - mapred.min.split.size overrides this Gzipped files are not split A split does not cross file boundary Number of Splits = Number of Map tasks 23

InputSplit => RecordReader Byte 0 Record = (Key, Value) InputFormat TextInputFormat Unless 1st, ignore all before 1st separator Read-ahead to next block to complete last record EOF

Partitioner Default partitioner evenly distributes records hashcode(key) mod NR Partitioner could be overridden When Value should also be considered - a single key, but values distributed When a partition needs to obey other semantics - Al URLs from a domain should be in the same file Interface Partitioner int getpartition(k, V, npartitions) 25

Producing Fully Sorted Output By default each reducer gets input sorted on key Typically reducer output order is the same as input Each part file is sorted How to make sure that Keys in part i are all less than keys in part i+1? Fully sorted output 26

Fully sorted output (contd.) Simple solution: Use single reducer But, not feasible for large data Insight: Reducer input also must be fully sorted Key to reducer mapping is determined by partitioner Design a partitioner that implements fully sorted reduce input Hint: Histogram equalization + Sampling 27

Streaming What about non-java programmers? Can define Mapper and Reducer using Unix text filters Typically use grep, sed, python, or perl scripts Format for input and output is: key \t value \n Allows for easy debugging and experimentation Slower than Java programs bin/hadoop jar hadoop-streaming.jar -input in_dir -output out_dir -mapper streamingmapper.sh -reducer streamingreducer.sh Mapper: sed -e 's \n g' grep. Reducer: uniq -c awk '{print $2 "\t" $1}' 28

Key-Value Separation in Map Output

Secondary Sort

Pipes (C++) C++ API and library to link application with C++ application is launched as a sub-process Keys and values are std::string with binary data Word count map looks like: class WordCountMap: public HadoopPipes::Mapper { public: WordCountMap(HadoopPipes::TaskContext& context){} void map(hadooppipes::mapcontext& context) { std::vector<std::string> words = HadoopUtils::splitString(context.getInputValue(), " "); for(unsigned int i=0; i < words.size(); ++i) { context.emit(words[i], "1"); } } };

The reducer looks like: Pipes (C++) class WordCountReduce: public HadoopPipes::Reducer { public: WordCountReduce(HadoopPipes::TaskContext& context){} void reduce(hadooppipes::reducecontext& context) { int sum = 0; while (context.nextvalue()) { sum += HadoopUtils::toInt(context.getInputValue()); } context.emit(context.getinputkey(), HadoopUtils::toString(sum)); } }; 32

Pipes (C++) And define a main function to invoke the tasks: int main(int argc, char *argv[]) { return HadoopPipes::runTask( HadoopPipes::TemplateFactory<WordCountMap, WordCountReduce, void, WordCountReduce>()); } 33

Deploying Auxiliary Files Command line option: -file auxfile.dat Job submitter adds file to job.jar Unjarred on the task tracker Available as $cwd/auxfile.dat Not suitable for more / larger / frequently used files 34

Using Distributed Cache Sometimes, you need to read side files such as in.txt Read-only Dictionaries (e.g., filtering patterns) Libraries dynamically linked to streaming programs Tasks themselves can fetch files from HDFS Not Always! (Unresolved symbols) Performance bottleneck 35

Caching Files Across Tasks Specify side files via cachefile If lot of such files needed Jar them up (.tgz coming soon) Upload to HDFS Specify via cachearchive TaskTracker downloads these files once Unjars archives Accessible in task s cwd before task even starts Automtic cleanup upon exit 36

How many Maps and Reduces Maps Usually as many as the number of HDFS blocks being processed, this is the default Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying the minimum split size The actual sizes of the map inputs are computed by: max(min(block_size, data/#maps), min_split_size) Reduces Unless the amount of data being processed is small: 0.95*num_nodes*mapred.tasktracker.tasks.maximum 37

Map Output => Reduce Input Map output is stored across local disks of task tracker So is reduce input Each task tracker machine also runs a Datanode In our config, datanode uses up to 85% of local disks Large intermediate outputs can fill up local disks and cause failures Non-even partitions too 38

Performance Analysis of Map-Reduce MR performance requires Maximizing Map input transfer rate Pipelined writes from Reduce Small intermediate output Opportunity to Load Balance 39

Map Input Transfer Rate Input locality HDFS exposes block locations Each map operates on one block Efficient decompression More efficient in Hadoop 0.18 Minimal deserialization overhead Java deserialization is very verbose Use Writable/Text 40

Performance Example Count lines in text files totaling several hundred GB Approach: Identity Mapper (input: text, output: same text) A single Reducer counts the lines and outputs the total What is wrong? This happened, really! 41

Intermediate Output Almost always the most expensive component (M x R) transfers over the network Merging and Sorting How to improve performance: Avoid shuffling/sorting if possible Minimize redundant transfers Compress 42

Avoid shuffling/sorting Set number of reducers to zero Known as map-only computations Filters, Projections, Transformations Beware of number of files generated Each map task produces a part file Make map produce equal number of output files as input files - How? Variable indicating current file being processed 43

Minimize Redundant Transfers Combiners Goal is to decrease size of the transient data When maps produce many repeated keys Often useful to do a local aggregation following the map Done by specifying a Combiner Combiners have the same interface as Reducers, and often are the same class. Combiners must not have side effects, because they run an indeterminate number of times. conf.setcombinerclass(reduce.class); 44

Compress Output Compressing the outputs and intermediate data will often yield huge performance gains Specified via a configuration file or set programatically Set mapred.output.compress=true to compress job output Set mapred.compress.map.output=true to compress map output Compression types: mapred.output.compression.type block - Group of keys and values are compressed together record - Each value is compressed individually Block compression is almost always best Compression codecs: mapred.output.compression.codec Default (zlib) - slower, but more compression LZO - faster, but less compression 45

Opportunity to Load Balance Load imbalance inherent in the application Imbalance in input splits Imbalance in computations Imbalance in partition sizes Load imbalance due to heterogeneous hardware Over time performance degradation Give Hadoop an opportunity to do load-balancing How many nodes should I allocate? 46

Load Balance (contd.) M = total number of simultaneous map tasks M = map task slots per tasktracker * nodes Chose nodes such that total mappers is between 5*M and 10*M. 47

Configuring Task Slots mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum Tradeoffs: Number of cores Amount of memory Number of local disks Amount of local scratch space Number of processes Consider resources consumed by TaskTracker & Datanode processes 48

Speculative execution The framework can run multiple instances of slow tasks Output from instance that finishes first is used Controlled by the configuration variable mapred.speculative.execution=[true false] Can dramatically bring in long tails on jobs 49

Performance Is your input splittable? Gzipped files are NOT splittable Are partitioners uniform? Buffering sizes (especially io.sort.mb) Do you need to Reduce? Only use singleton reduces for very small data Use Partitioners and cat to get a total order Memory usage Do not load all of your inputs into memory. 50

Debugging & Diagnosis Run job with the Local Runner Set mapred.job.tracker to local Runs application in a single process and thread Run job on a small data set on a 1 node cluster Can be done on your local dev box Set keep.failed.task.files to true This will keep files from failed tasks that can be used for debugging Use the IsolationRunner to run just the failed task Java Debugging hints Send a kill -QUIT to the Java process to get the call stack, locks held, deadlocks 51

Example: Computing Standard Deviation Takeaway: Changing algorithm to suit architecture yields best implementation σ = 1 N N i=1 (x i x) 2 52

Implementation 1 Two Map-Reduce stages First stage computes Mean Second stage computes std deviation 53

Implementation 1 (contd.) Stage 1: Compute Mean Map Input (xi for i = 1..Nm) Map Output (Nm, Mean(x1..Nm)) Single Reducer Reduce Input (Group(Map Output)) Reduce Output (Mean(x1..N)) 54

Implementation 1 (contd.) Stage 2: Compute Standard deviation Map Input (xi for i = 1..Nm) & Mean(x1..N) Map Output (Sum(xi Mean(x))2 for i = 1..Nm Single Reducer Reduce Input (Group (Map Output)) & N Reduce Output (Standard Deviation) Problem: Two passes over large input data 55

Implementation 2 Second definition algebraic equivalent Be careful about numerical accuracy, though σ = 1 N N i=1 x 2 i N x 2 56

Implementation 2 (contd.) Single Map-Reduce stage Map Input (x i for i = 1..N m ) Map Output (N m, [Sum(x 2 1..Nm ),Mean(x 1..Nm )]) Single Reducer Reduce Input (Group (Map Output)) Reduce Output (σ) Advantage: Only a single pass over large input

Q&A 58