Big Data Processing using Hadoop Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center
Apache Hadoop Hadoop INRIA S.IBRAHIM 2 2
Hadoop Hadoop is a top- level Apache project» Open source implementation of MapReduce» Developed in Java It was advocated by industry s premier Web players Google, Yahoo!, Microsoft, and Facebook as the engine to power the cloud. Hadoop INRIA S.IBRAHIM 3
Hadoop History Hadoop INRIA S.IBRAHIM 4
Who Uses Hadoop?» Facebook» IBM : Blue Cloud» Joost» Last.fm» New York Times» Microsoft» Cloudera» Yahoo! The mapreduce programming model and implementations, H Jin, S Ibrahim, L Qi, H Cao, S Wu, X Shi. Cloud computing: principles and paradigms. Wiley,2011 Hadoop INRIA S.IBRAHIM 5
Hadoop: Where? Batch data processing, not real- time web search Log processing Document analysis and indexing Web graphs and crawling Highly parallel data intensive distributed applications Hadoop INRIA S.IBRAHIM 6
Public Hadoop Clouds Hadoop Map- reduce on Amazon EC2 http:// wiki.apache.org/hadoop/amazonec2 IBM Blue Cloud» Partnering with Google to offer web- scale infrastructure Global Cloud Computing Testbed» Joint effort by Yahoo, HP and Intel http:// www.opencloudconsortium.org/testbed.html Hadoop INRIA S.IBRAHIM 7
Big Data Processing in the Cloud INRIA S.IBRAHIM 8
Source: Wikibon 2013 Hadoop INRIA S.IBRAHIM 9
Batch Big data Processing Adapted from Presentations from: http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 10
Components Distributed File System» Name: HDFS» Inspired by GFS Distributed Processing Framework» Modelled on the MapReduce abstraction http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 11
HDFS Distributed storage system» Files are divided into large blocks (128 MB)» Blocks are distributed across the cluster» Blocks are replicated to help against hardware failure» Data placement is exposed so that computation can be migrated to data Notable differences from mainstream DFS work» Single storage + compute cluster vs. separate clusters» Simple I/O centric API http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 12
HDFS Architecture: NameNode (1) Master- Slave Architecture HDFS Master NameNode» Manages all file system metadata in memory List of files For each file name, a set of blocks For each block, a set of DataNodes File attributes (creation time, replication factor)» Controls read/write access to files» Manages block replication» Transaction log: register file creation, deletion, etc. http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 13
HDFS Architecture: DataNodes (2) HDFS Slaves DataNodes A DataNode is a block server» Stores data in the local file system (e.g. ext3)» Stores meta- data of a block (e.g. CRC)» Serves data and meta- data to Clients Block Report» Periodically sends a report of all existing blocks to the NameNode Pipelining of Data» Forwards data to other specified DataNodes Perform replication tasks upon instruction by NameNode Rack- aware http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 14
HDFS Architecture (3) http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 15
Fault Tolerance in HDFS DataNodes send heartbeats to the NameNode» Once every 3 seconds NameNode uses heartbeats to detect DataNode failures» Chooses new DataNodes for new replicas» Balances disk usage» Balances communication traffic to DataNodes http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 16
NameNode Failures A single point of failure Transaction Log stored in multiple directories» A directory on the local file system» A directory on a remote file system (NFS/CIFS) Need to develop a real HA solution http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 17
Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block» Client writes block to the first DataNode» The first DataNode forwards the data to the next» The second DataNode forwards the data to the next DataNode in the Pipeline» When all replicas are written, the client moves on to write the next block in file http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 18
User Interface Command for HDFS User:» hadoop dfs - mkdir /foodir» hadoop dfs - cat /foodir/myfile.txt» hadoop dfs - rm /foodir myfile.txt Command for HDFS Administrator» hadoop dfsadmin - report» hadoop dfsadmin - decommission datanodename Web Interface» http://host:port/dfshealth.jsp http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 19
Useful links HDFS Design:» http://hadoop.apache.org/core/docs/current/ hdfs_design.html Hadoop API:» http://hadoop.apache.org/core/docs/current/api/ http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 20
Hadoop MapReduce Master- Slave architecture Map- Reduce Master JobTracker» Accepts MR jobs submitted by users» Assigns Map and Reduce tasks to TaskTrackers» Monitors task and TaskTracker status, re- executes tasks upon failure http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 21
Hadoop MapReduce Master- Slave architecture Map- Reduce Slaves TaskTrackers» Run Map and Reduce tasks upon instruction from the JobTracker Manage storage and transmission of intermediate output» Generic Reusable Framework supporting pluggable user code (file system, I/O format) http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 22
Deployment: HDFS + MR http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 23
Hadoop MapReduce Client Define Mapper and Reducer classes and a launching program Language support» Java» C++» Python http://wiki.apache.org/hadoop/hadooppresentations Hadoop INRIA S.IBRAHIM 24
Zoom on Map Phase Handling partitioning skew in MapReduce using LEEN S Ibrahim, H Jin, L Lu, B He, G Antoniu, S Wu - Peer- to- Peer Networking and Applications, 2013 Hadoop INRIA S.IBRAHIM 25 25
Zoom on Reduce Phase Hadoop INRIA S.IBRAHIM 26
Data Locality Data Locality is exposed in the Map Task scheduling Data are Replicated:» Fault tolerance» Performance : divide the work among nodes Job Tracker schedules map tasks considering:» Node- aware» Rack- aware» non- local map Tasks Hadoop INRIA S.IBRAHIM 27
Fault- tolerance TaskTrackers send heartbeats to the Job Tracker» Once every 3 seconds TaskTracker uses heartbeats to detect» Node is labled as failed If no heartbeat is recieved for a defined expiry time (Defualt : 10 Minutes) Re- execute all the ongoing and complete tasks Need to develop a more efficient policy to prevent re- executing completed tasks (storing this data in HDFS) Hadoop INRIA S.IBRAHIM 28
Speculation in Hadoop Nodes slow (stragglers) à run backup tasks Node 1 Node 2 Adopted from a presentation by Matei Zaharia Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, San Diego, CA, December 2008. Hadoop INRIA S.IBRAHIM 29
Life cycle of Map/ Reduce Task Hadoop INRIA S.IBRAHIM 30
Hadoop ecosystem Hadoop INRIA S.IBRAHIM 31
Hadoop ecosystem Core: Hadoop MapReduce and HDFS Data Access: Hbase, Pig, Hive Algorithms: Mahout Data Import: Flume, Sqoop and Nutch Hadoop INRIA S.IBRAHIM 32
Hadoop ecosystem Database primitives Hbase» Wide column data structure built on HDFS Processing Pig» A high level data- flow language and execution framework for parallel computatuon Hadoop INRIA S.IBRAHIM 33
Hadoop ecosystem Algorithm Mahout» Scalable machine learning data mining» Runs on the tip of Hadoop Processing Hive» Data Warehouse Infrastructure» Support for custom mappers and reducers for more sophisticated analysis Hadoop INRIA S.IBRAHIM 34
Hadoop ecosystem Data Import Sqoop» Structured Data» Import and Export from RDBMS to HDFS Data Import Flume» Import streams» Text files and systems logs Hadoop INRIA S.IBRAHIM 35
Hadoop ecosystem Coordination ZooKeeper» A high- performance coordination service for distributed applications. Scheduling Oozie» A workflow scheduler system to manage Apache Hadoop jobs Hadoop INRIA S.IBRAHIM 36
Job Scheduling in Hadoop Hadoop INRIA S.IBRAHIM 37
Why Jobs Scheduling in Hadoop? MapReduce / Hadoop originally designed for high throughput batch processing Today s workload is far more diverse:» Many users want to share a cluster Engineering, marketing, business intelligence, etc» Vast majority of jobs are short Ad- hoc queries, sampling, periodic reports» Response time is critical Interactive queries, deadline- driven reports How can we efficiently share MapReduce clusters between users? Hadoop INRIA S.IBRAHIM 38
Job Scheduling in Hadoop Assigning jobs to Hadoop Cluster Considerations» Job priority (e.g., deadline)» Capacity Cluster resources available resources needed for job Hadoop INRIA S.IBRAHIM 39
Job Scheduling in Hadoop FIFO» The first job to arrive at a JobTracker is processed first Capacity Scheduling» Organizes jobs into queues» Queue shares as % s of cluster Fair Scheduler» Group jobs into pools» Divide excess capacity evenly between pools» Delay scheduler to expose data locality Hadoop INRIA S.IBRAHIM 40
Hadoop Optimizations Hadoop INRIA S.IBRAHIM 41
Open Issues Heavy access All the metadata are handled through one single Master in HDFS (Name node) Perform bad when» Handling Many files» Heavy concurrency Hadoop INRIA S.IBRAHIM 42
The BlobSeer Approach to Concurrency- Optimized Data Management BlobSeer: software platform for scalable, distributed BLOB management» Decentralized data storage» Decentralized metadata management» Versioning- based concurrency control» Lock- free concurrent writes (enabled by versioning) A back- end for higher- level data management systems» Short term: highly scalable distributed file systems» Middle term: storage for cloud services» Long term: extremely large distributed databases http://blobseer.gforge.inria.fr/ Hadoop INRIA S.IBRAHIM 43
Highlight: BlobSeer Does Better Than Hadoop! distributed grep Original Hadoop Hadoop+BlobSeer sort Execution time reduced by up to 38% MapReduce: a natural application class for BlobSeer Journal of Parallel and Distributed Computing (2011), IPDPS 2010, MAPREDUCE 2010 (HPDC) Hadoop INRIA S.IBRAHIM 44
Open Issues Data locality Data locality is crucial for Hadoop s performance How can we expose data- locality of Hadoop in the Cloud efficiently? Hadoop in the Cloud» Unaware of network topology» Node- aware or non- local map tasks Hadoop INRIA S.IBRAHIM 45
Open Issues Data locality Data locality in the Cloud Node1 Node2 Node3 Node4 Node5 Node6 1 3 4 2 5 8 1 6 7 2 3 5 4 7 10 6 9 10 9 12 12 8 11 11 Empty node Empty node Empty node The simplicity of Map tasks Scheduling leads to Non- local maps execution (25%) Hadoop INRIA S.IBRAHIM 46
Side impacts: Increase the execution time Increase the number of useless speculation Slot occupying» Imbalance in the Map execution among nodes Hadoop INRIA S.IBRAHIM 47
Maestro: Replica- Aware scheduling in Hadoop Schedule the map tasks in two waves: First wave: fills the empty slots of each data node based on the number of hosted map tasks and on the replication scheme for their input data Second wave: runtime scheduling takes into account the probability of scheduling a map task on a given machine depending on the replicas of the task s input data. Results: Maestro can achieve optimal data locality even if data are not uniformly distributed among nodes and improve the performance of Hadoop applications Hadoop INRIA S.IBRAHIM 48
Results Sort application Local maps Virtual Cluster Grid5000 Native Hadoop 83% Maestro 96% Native Hadoop 84% Maestro 97% S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, S. Wu, Maestro: Replica- aware map scheduling for mapreduce, in: CCGrid 2012 Hadoop INRIA S.IBRAHIM 49
Open Issues Data Skew Data Skew The current Hadoop s hash partitioning works well when the keys are equally appeared and uniformly stored in the data nodes In the presence of Partitioning Skew:» Variation in Intermediate Keys frequencies» Variation in Intermediate Key s distribution amongst different data node Native blindly hash- partitioning is to be inadequate and will lead to:» Network congestion» Unfairness in reducers inputs? Reduce computation Skew» Performance degradation Hadoop INRIA S.IBRAHIM 50
Open Issues Data Skew Data Node1 K1 K1 K1 K2 K2 K2 Data Node2 K1 K1 K1 K1 K1 K1 Data Node3 K1 K1 K1 K1 K2 K2 K2 K2 K3 K3 K3 K3 K1 K1 K1 K2 K4 K4 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 hash (Hash code (Intermediate- key) Modulo ReduceID) K1 K2 K3 K4 K5 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer 11 15 18 Total 44/54 Reduce Input 29 17 8 cv 58% Hadoop INRIA S.IBRAHIM 51
Example: Wordcount Example 6- node, 2 GB data set! Combine Function is disabled Transferred Data is relatively Large Data Distribution is Imbalanced Max- Min Ratio cv Data Distribution 20% 42% Data Size (MB) 1000 900 800 700 600 500 400 300 200 100 0 Map Output Reduce Input Map Output 83% of the Maps output Transferred Data Local Data Data During Failed Reduce Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06 Hadoop INRIA S.IBRAHIM 52
LEEN Partitioning Algorithm Extend the Locality- aware concept to the Reduce Tasks Consider fair distribution of reducers inputs Results:» Balanced distribution of reducers input» Minimize the data transfer during shuffle phase» Improve the response time Close To optimal tradeoff between Data Locality and reducers input Fairness S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, L. Qi, Leen: Locality/fairness- aware key partitioning for mapreduce in the cloud, in: CLOUDCOM 10 Hadoop INRIA S.IBRAHIM 53
LEEN details (Example) K1 K2 k3 k4 k5 k6 Node1 3 5 4 4 1 1 18 Node2 9 1 0 3 2 3 18 Node3 4 3 0 6 5 0 18 Total 16 9 4 13 8 4 FLK 4.66 2.93 1.88 2.70 2.71 1.66 Data Transfer = 24/54 N cv = 1 N14% Data Node1 2 For K 1 If (to N 2 ) à 15 N Data Node1 Node2 3 25 14 à Fairness-Score = 4.9 K1 K1 K1 K2 K2 K2 (to N 3 ) à 15 9 30 à Fairness-Score = 8.8 K2 K2 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K4 K4 Data Node1 Node3 K1 K1 K1 K1 K2 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 54 Hadoop INRIA S.IBRAHIM 54
Results 3 4 5 6 Locality Range 1-85% 15-35% 1-50% 1-30% S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, L. Qi, Leen: Locality/fairness- aware key partitioning for mapreduce in the cloud, in: CLOUDCOM 10 Hadoop INRIA S.IBRAHIM 55
Open Issues Energy Efficency Data- centers consume large amount of energy» High monetary Power bills become a substantial part of the total cost of ownership (TCO) of data- center (40% of the total cost)» High carbon emission Hadoop INRIA S.IBRAHIM 56
Why Energy efficient Hadoop? THE engine for Big Data processing in data- centers Power bills become a substantial part of the total cost of ownership (TCO) of data- centers It is essential to explore and optimize the energy efficiency when running Big Data application in Hadoop Hadoop INRIA S.IBRAHIM 57
Energy- efficiency in Hadoop Hadoop INRIA S.IBRAHIM 58
DVFS in Hadoop? Diversity of MapReduce applications CPU load is high (98%) during almost 75% of the job running CPU load is high(80%) during only 15% of the job running Multiple phases of MapReduce application Disk I/O CPU Disk I/O Network There is a significant potential of energy saving by scaling down the CPU frequency when peak CPU is not needed Hadoop INRIA S.IBRAHIM 59 59
DVFS governors Performance Powersave OnDemand Conservative Userspace Maximize Performance Maximize power savings Power efficiency with reasonable performance Support for user- defined frequencies Static Dynamic Static Highest available frequency Lowest available frequency Gradually degrades the CPU frequency when the load is low Adjust to the highest available frequency when the load is high Gradually upgrades the CPU frequency when the load is high High power consumption Long response time Low performance/ power saving benefits when the system switches between idle states and heavy load often Worse performance than Ondemand Hadoop INRIA S.IBRAHIM 60
Power Consumption 40% Pi application Pi application Towards Efficient Power Management in MapReduce: Investigation of CPU- Frequencies Scaling on Power Efficiency in Hadoop, S. Ibrahim et al, ARMS CC 2014. Hadoop INRIA S.IBRAHIM 61
Performance vs Energy Pi application Towards Efficient Power Management in MapReduce: Investigation of CPU- Frequencies Scaling on Power Efficiency in Hadoop, S. Ibrahim et al, ARMS CC 2014. Hadoop INRIA S.IBRAHIM 62
Open Issue: Hadoop in Virtualized environment The applicability of MapReduce on Virtualized Cloud MapReduce Cloud Computing Virtual Machine Simplicity Virtual Fault Tolerance Machine Locality Optimization Load Balancing Scalability Various Implementations... Reliable and Robustness Dynamic scaling Simplicity Green Data Center Flexible Migration and Restart Capabilities Abstraction layer for the hardware, software, and configuration of systems Pay as you go with no lock-in Dynamic Hardware and Software Maintenance Security Resource Abstraction Performance Isolation Check Pointing / Migration Power Saving Optimize Resource Utilization. Hadoop INRIA S.IBRAHIM 63
Physical vs Virtual Hadoop Cluster HDFS on Physical Machine is Better than VMS Separate the permanent data- storage (DFS) from the virtual storage associated with VM. Evaluating mapreduce on virtual machines: The hadoop case, S Ibrahim, H Jin, L Lu, L Qi, S Wu, X Shi - Cloud Computing, 2009 Hadoop INRIA S.IBRAHIM 64
Multiple VM Deployment Deploying More VMs can improve the performance VMs within the same physical node are competing for the node I/O, causing poor performance. Evaluating mapreduce on virtual machines: The hadoop case, S Ibrahim, H Jin, L Lu, L Qi, S Wu, X Shi - Cloud Computing, 2009 Hadoop INRIA S.IBRAHIM 65
CLOUDLET CLOUDLET: towards mapreduce implementation on virtual machines, S Ibrahim, H Jin, B Cheng, H Cao, S Wu, L Qi. HPDC 2009 Hadoop INRIA S.IBRAHIM 66
Evaluation Three- Nodes Cluster Achieved up to 9% on very small Achieved up to 50% on very small cluster (WC) cluster (Sort) Hadoop INRIA S.IBRAHIM 67 Elapsed Time (Sec) 2500 2000 1500 1000 500 0 Native Separate 512MB 1GB 2GB 1VM
Meta Disk I/O Scheduler Different Disk Schedulers in Xen:» CFQ, Anticipatory, Deadline, Noop» Scheduling in two level (VMM, VM) Selecting the appropriate pair schedulers can significantly affect the applications performance MapReduce is used to run different applications A typical MapReduce application consists of different interleaving stages, each requiring different I/O workloads and patterns Is their an optimal pair schedulers for MapReduce application? Hadoop INRIA S.IBRAHIM 68
Motivation Wordcount Wordcount w/o combiner Sort (A, C) à 4.5% (A,N) (D,N) à 6% (A,A) à 9% The disk pairs Schedulers are only sub- optimal for different MapReduce applications Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 69
Motivation A typical Hadoop application consists of different interleaving stages, each requiring different I/O workloads and patterns. The pairs schedulers are also sub- optimal for different sub- phases of the whole job. Hadoop INRIA S.IBRAHIM 70
Solution- Meta Scheduler A novel approach for adaptively tuning the disk scheduler pairs in both the hypervisor and the virtual machines during the execution of a single MapReduce job.» First, manually dividing the Hadoop program into phases based on the characteristics of Hadoop applications.» If there are phases in an application and a possible pair scheduler, there are unique solutions, where a solution is an assignment of disk pair schedulers to each phase.» there is a penalty of the switching overhead Novel heuristic that searches the space of solutions.» It executes a solution and evaluates the performance score including the switch cost. Based on the evaluation, it selects the next solution to evaluate. Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 71
Evaluation Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 72
Acknowledgments Dhruba Borthakur (Yahoo!) Vinod Kumar Vavilapalli (Hortonworks) Apache Hadoop Community Hadoop INRIA S.IBRAHIM 73
Thank you! Shadi Ibrahim shadi.ibrahim@inria.fr 74
YARN : Next Generation Hadoop Adapted from a Presentation by Vinod Kumar Vavilapalli(Hortonworks), Hadoop Yarn, SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 75
Hadoop MapReduce Classic JobTracker Manages cluster resources and job scheduling TaskTracker Per- node agent Manage tasks Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 76
Current Limitations Scalability Maximum Cluster size 4,000 nodes Maximum concurrent tasks 40,000 Coarse synchronization in JobTracker Single point of failure Failure kills all queued and running jobs Jobs need to be re- submitted by users Restart is very tricky due to complex state Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 77
Current Limitations - 2 Hard partition of resources into map and reduce slots Low resource utilization Lacks support for alternate paradigms Iterative applications implemented using MapReduce are 10x slower Hacks for the likes of MPI/Graph Processing Lack of wire- compatible protocols Client and cluster must be of same version Applications and workflows cannot migrate to different clusters Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 78
Requirements Reliability Availability Utilization WireCompatibility Agility & Evolution Abilityforcustomerstocontrol upgrades to the grid software stack. Scalability- Clusters of 6,000-10,000 machines Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks 100,000+ concurrent tasks 10,000 concurrent jobs Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 79
Design Centre Split up the two major functions of JobTracker Cluster resource management Application life- cycle management MapReduce becomes user- land library Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 80
Architecture Application Application is a job submitted to the framework Example Map Reduce Job Container Basic unit of allocation Example container A = 2GB, 1CPU Replaces the fixed map/reduce slots Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 81
Architecture Resource Manager Global resource scheduler Hierarchical queues Node Manager Per- machine agent Manages the life- cycle of container Container resource monitoring Application Master Per- application Manages application scheduling and task execution E.g. MapReduce Application Master Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 82
Architecture Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 83
Improvements vs classic MapReduce Utilization Generic resource model Data Locality (node, rack etc.) Memory CPU Disk b/q Network b/w Remove fixed partition of map and reduce slot Scalability Application life- cycle management is very expensive Partition resource management and application life- cycle management Application management is distributed Hardware trends - Currently run clusters of 4,000 machines 6,000 2012 machines > 12,000 2009 machines <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB> Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 84
Improvements vsclassic MapReduce Fault Tolerance and Availability Resource Manager No single point of failure state saved in ZooKeeper (coming soon) Application Masters are restarted automatically on RM restart Application Master Optional failover via application- specific checkpoint MapReduce applications pick up where they left off via state saved in HDFS Wire Compatibility Protocols are wire- compatible Old clients can talk to new servers Rolling upgrades Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 85
Improvements vs classic MapReduce Innovation and Agility MapReduce now becomes a user- land library Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements Customers upgrade MapReduce versions on their schedule Users can customize MapReduce Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 86
Improvements vs classic MapReduce Support for programming paradigms other than MapReduce MPI Master- Worker Machine Learning Iterative processing Enabled by allowing the use of paradigm- specific application master Run all on the same Hadoop cluster Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 87
Any Performance Gains? Significant gains across the board, 2x in some cases. MapReduce Unlock lots of improvements from Terasort record (Owen/Arun, 2009) Shuffle 30%+ Small Jobs Uber AM Re- use task slots (containers) Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 88
Thank you! Shadi Ibrahim shadi.ibrahim@inria.fr Big Data Processing in the Cloud INRIA S.IBRAHIM 89