Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Size: px
Start display at page:

Download "Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center"

Transcription

1 Big Data Processing using Hadoop Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

2 Apache Hadoop Hadoop INRIA S.IBRAHIM 2 2

3 Hadoop Hadoop is a top- level Apache project» Open source implementation of MapReduce» Developed in Java It was advocated by industry s premier Web players Google, Yahoo!, Microsoft, and Facebook as the engine to power the cloud. Hadoop INRIA S.IBRAHIM 3

4 Hadoop History Hadoop INRIA S.IBRAHIM 4

5 Who Uses Hadoop?» Facebook» IBM : Blue Cloud» Joost» Last.fm» New York Times» Microsoft» Cloudera» Yahoo! The mapreduce programming model and implementations, H Jin, S Ibrahim, L Qi, H Cao, S Wu, X Shi. Cloud computing: principles and paradigms. Wiley,2011 Hadoop INRIA S.IBRAHIM 5

6 Hadoop: Where? Batch data processing, not real- time web search Log processing Document analysis and indexing Web graphs and crawling Highly parallel data intensive distributed applications Hadoop INRIA S.IBRAHIM 6

7 Public Hadoop Clouds Hadoop Map- reduce on Amazon EC2 wiki.apache.org/hadoop/amazonec2 IBM Blue Cloud» Partnering with Google to offer web- scale infrastructure Global Cloud Computing Testbed» Joint effort by Yahoo, HP and Intel Hadoop INRIA S.IBRAHIM 7

8 Big Data Processing in the Cloud INRIA S.IBRAHIM 8

9 Source: Wikibon 2013 Hadoop INRIA S.IBRAHIM 9

10 Batch Big data Processing Adapted from Presentations from: Hadoop INRIA S.IBRAHIM 10

11 Components Distributed File System» Name: HDFS» Inspired by GFS Distributed Processing Framework» Modelled on the MapReduce abstraction Hadoop INRIA S.IBRAHIM 11

12 HDFS Distributed storage system» Files are divided into large blocks (128 MB)» Blocks are distributed across the cluster» Blocks are replicated to help against hardware failure» Data placement is exposed so that computation can be migrated to data Notable differences from mainstream DFS work» Single storage + compute cluster vs. separate clusters» Simple I/O centric API Hadoop INRIA S.IBRAHIM 12

13 HDFS Architecture: NameNode (1) Master- Slave Architecture HDFS Master NameNode» Manages all file system metadata in memory List of files For each file name, a set of blocks For each block, a set of DataNodes File attributes (creation time, replication factor)» Controls read/write access to files» Manages block replication» Transaction log: register file creation, deletion, etc. Hadoop INRIA S.IBRAHIM 13

14 HDFS Architecture: DataNodes (2) HDFS Slaves DataNodes A DataNode is a block server» Stores data in the local file system (e.g. ext3)» Stores meta- data of a block (e.g. CRC)» Serves data and meta- data to Clients Block Report» Periodically sends a report of all existing blocks to the NameNode Pipelining of Data» Forwards data to other specified DataNodes Perform replication tasks upon instruction by NameNode Rack- aware Hadoop INRIA S.IBRAHIM 14

15 HDFS Architecture (3) Hadoop INRIA S.IBRAHIM 15

16 Fault Tolerance in HDFS DataNodes send heartbeats to the NameNode» Once every 3 seconds NameNode uses heartbeats to detect DataNode failures» Chooses new DataNodes for new replicas» Balances disk usage» Balances communication traffic to DataNodes Hadoop INRIA S.IBRAHIM 16

17 NameNode Failures A single point of failure Transaction Log stored in multiple directories» A directory on the local file system» A directory on a remote file system (NFS/CIFS) Need to develop a real HA solution Hadoop INRIA S.IBRAHIM 17

18 Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block» Client writes block to the first DataNode» The first DataNode forwards the data to the next» The second DataNode forwards the data to the next DataNode in the Pipeline» When all replicas are written, the client moves on to write the next block in file Hadoop INRIA S.IBRAHIM 18

19 User Interface Command for HDFS User:» hadoop dfs - mkdir /foodir» hadoop dfs - cat /foodir/myfile.txt» hadoop dfs - rm /foodir myfile.txt Command for HDFS Administrator» hadoop dfsadmin - report» hadoop dfsadmin - decommission datanodename Web Interface» Hadoop INRIA S.IBRAHIM 19

20 Useful links HDFS Design:» hdfs_design.html Hadoop API:» Hadoop INRIA S.IBRAHIM 20

21 Hadoop MapReduce Master- Slave architecture Map- Reduce Master JobTracker» Accepts MR jobs submitted by users» Assigns Map and Reduce tasks to TaskTrackers» Monitors task and TaskTracker status, re- executes tasks upon failure Hadoop INRIA S.IBRAHIM 21

22 Hadoop MapReduce Master- Slave architecture Map- Reduce Slaves TaskTrackers» Run Map and Reduce tasks upon instruction from the JobTracker Manage storage and transmission of intermediate output» Generic Reusable Framework supporting pluggable user code (file system, I/O format) Hadoop INRIA S.IBRAHIM 22

23 Deployment: HDFS + MR Hadoop INRIA S.IBRAHIM 23

24 Hadoop MapReduce Client Define Mapper and Reducer classes and a launching program Language support» Java» C++» Python Hadoop INRIA S.IBRAHIM 24

25 Zoom on Map Phase Handling partitioning skew in MapReduce using LEEN S Ibrahim, H Jin, L Lu, B He, G Antoniu, S Wu - Peer- to- Peer Networking and Applications, 2013 Hadoop INRIA S.IBRAHIM 25 25

26 Zoom on Reduce Phase Hadoop INRIA S.IBRAHIM 26

27 Data Locality Data Locality is exposed in the Map Task scheduling Data are Replicated:» Fault tolerance» Performance : divide the work among nodes Job Tracker schedules map tasks considering:» Node- aware» Rack- aware» non- local map Tasks Hadoop INRIA S.IBRAHIM 27

28 Fault- tolerance TaskTrackers send heartbeats to the Job Tracker» Once every 3 seconds TaskTracker uses heartbeats to detect» Node is labled as failed If no heartbeat is recieved for a defined expiry time (Defualt : 10 Minutes) Re- execute all the ongoing and complete tasks Need to develop a more efficient policy to prevent re- executing completed tasks (storing this data in HDFS) Hadoop INRIA S.IBRAHIM 28

29 Speculation in Hadoop Nodes slow (stragglers) à run backup tasks Node 1 Node 2 Adopted from a presentation by Matei Zaharia Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, San Diego, CA, December Hadoop INRIA S.IBRAHIM 29

30 Life cycle of Map/ Reduce Task Hadoop INRIA S.IBRAHIM 30

31 Hadoop ecosystem Hadoop INRIA S.IBRAHIM 31

32 Hadoop ecosystem Core: Hadoop MapReduce and HDFS Data Access: Hbase, Pig, Hive Algorithms: Mahout Data Import: Flume, Sqoop and Nutch Hadoop INRIA S.IBRAHIM 32

33 Hadoop ecosystem Database primitives Hbase» Wide column data structure built on HDFS Processing Pig» A high level data- flow language and execution framework for parallel computatuon Hadoop INRIA S.IBRAHIM 33

34 Hadoop ecosystem Algorithm Mahout» Scalable machine learning data mining» Runs on the tip of Hadoop Processing Hive» Data Warehouse Infrastructure» Support for custom mappers and reducers for more sophisticated analysis Hadoop INRIA S.IBRAHIM 34

35 Hadoop ecosystem Data Import Sqoop» Structured Data» Import and Export from RDBMS to HDFS Data Import Flume» Import streams» Text files and systems logs Hadoop INRIA S.IBRAHIM 35

36 Hadoop ecosystem Coordination ZooKeeper» A high- performance coordination service for distributed applications. Scheduling Oozie» A workflow scheduler system to manage Apache Hadoop jobs Hadoop INRIA S.IBRAHIM 36

37 Job Scheduling in Hadoop Hadoop INRIA S.IBRAHIM 37

38 Why Jobs Scheduling in Hadoop? MapReduce / Hadoop originally designed for high throughput batch processing Today s workload is far more diverse:» Many users want to share a cluster Engineering, marketing, business intelligence, etc» Vast majority of jobs are short Ad- hoc queries, sampling, periodic reports» Response time is critical Interactive queries, deadline- driven reports How can we efficiently share MapReduce clusters between users? Hadoop INRIA S.IBRAHIM 38

39 Job Scheduling in Hadoop Assigning jobs to Hadoop Cluster Considerations» Job priority (e.g., deadline)» Capacity Cluster resources available resources needed for job Hadoop INRIA S.IBRAHIM 39

40 Job Scheduling in Hadoop FIFO» The first job to arrive at a JobTracker is processed first Capacity Scheduling» Organizes jobs into queues» Queue shares as % s of cluster Fair Scheduler» Group jobs into pools» Divide excess capacity evenly between pools» Delay scheduler to expose data locality Hadoop INRIA S.IBRAHIM 40

41 Hadoop Optimizations Hadoop INRIA S.IBRAHIM 41

42 Open Issues Heavy access All the metadata are handled through one single Master in HDFS (Name node) Perform bad when» Handling Many files» Heavy concurrency Hadoop INRIA S.IBRAHIM 42

43 The BlobSeer Approach to Concurrency- Optimized Data Management BlobSeer: software platform for scalable, distributed BLOB management» Decentralized data storage» Decentralized metadata management» Versioning- based concurrency control» Lock- free concurrent writes (enabled by versioning) A back- end for higher- level data management systems» Short term: highly scalable distributed file systems» Middle term: storage for cloud services» Long term: extremely large distributed databases Hadoop INRIA S.IBRAHIM 43

44 Highlight: BlobSeer Does Better Than Hadoop! distributed grep Original Hadoop Hadoop+BlobSeer sort Execution time reduced by up to 38% MapReduce: a natural application class for BlobSeer Journal of Parallel and Distributed Computing (2011), IPDPS 2010, MAPREDUCE 2010 (HPDC) Hadoop INRIA S.IBRAHIM 44

45 Open Issues Data locality Data locality is crucial for Hadoop s performance How can we expose data- locality of Hadoop in the Cloud efficiently? Hadoop in the Cloud» Unaware of network topology» Node- aware or non- local map tasks Hadoop INRIA S.IBRAHIM 45

46 Open Issues Data locality Data locality in the Cloud Node1 Node2 Node3 Node4 Node5 Node Empty node Empty node Empty node The simplicity of Map tasks Scheduling leads to Non- local maps execution (25%) Hadoop INRIA S.IBRAHIM 46

47 Side impacts: Increase the execution time Increase the number of useless speculation Slot occupying» Imbalance in the Map execution among nodes Hadoop INRIA S.IBRAHIM 47

48 Maestro: Replica- Aware scheduling in Hadoop Schedule the map tasks in two waves: First wave: fills the empty slots of each data node based on the number of hosted map tasks and on the replication scheme for their input data Second wave: runtime scheduling takes into account the probability of scheduling a map task on a given machine depending on the replicas of the task s input data. Results: Maestro can achieve optimal data locality even if data are not uniformly distributed among nodes and improve the performance of Hadoop applications Hadoop INRIA S.IBRAHIM 48

49 Results Sort application Local maps Virtual Cluster Grid5000 Native Hadoop 83% Maestro 96% Native Hadoop 84% Maestro 97% S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, S. Wu, Maestro: Replica- aware map scheduling for mapreduce, in: CCGrid 2012 Hadoop INRIA S.IBRAHIM 49

50 Open Issues Data Skew Data Skew The current Hadoop s hash partitioning works well when the keys are equally appeared and uniformly stored in the data nodes In the presence of Partitioning Skew:» Variation in Intermediate Keys frequencies» Variation in Intermediate Key s distribution amongst different data node Native blindly hash- partitioning is to be inadequate and will lead to:» Network congestion» Unfairness in reducers inputs? Reduce computation Skew» Performance degradation Hadoop INRIA S.IBRAHIM 50

51 Open Issues Data Skew Data Node1 K1 K1 K1 K2 K2 K2 Data Node2 K1 K1 K1 K1 K1 K1 Data Node3 K1 K1 K1 K1 K2 K2 K2 K2 K3 K3 K3 K3 K1 K1 K1 K2 K4 K4 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 hash (Hash code (Intermediate- key) Modulo ReduceID) K1 K2 K3 K4 K5 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer Total 44/54 Reduce Input cv 58% Hadoop INRIA S.IBRAHIM 51

52 Example: Wordcount Example 6- node, 2 GB data set! Combine Function is disabled Transferred Data is relatively Large Data Distribution is Imbalanced Max- Min Ratio cv Data Distribution 20% 42% Data Size (MB) Map Output Reduce Input Map Output 83% of the Maps output Transferred Data Local Data Data During Failed Reduce Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06 Hadoop INRIA S.IBRAHIM 52

53 LEEN Partitioning Algorithm Extend the Locality- aware concept to the Reduce Tasks Consider fair distribution of reducers inputs Results:» Balanced distribution of reducers input» Minimize the data transfer during shuffle phase» Improve the response time Close To optimal tradeoff between Data Locality and reducers input Fairness S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, L. Qi, Leen: Locality/fairness- aware key partitioning for mapreduce in the cloud, in: CLOUDCOM 10 Hadoop INRIA S.IBRAHIM 53

54 LEEN details (Example) K1 K2 k3 k4 k5 k6 Node Node Node Total FLK Data Transfer = 24/54 N cv = 1 N14% Data Node1 2 For K 1 If (to N 2 ) à 15 N Data Node1 Node à Fairness-Score = 4.9 K1 K1 K1 K2 K2 K2 (to N 3 ) à à Fairness-Score = 8.8 K2 K2 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K4 K4 Data Node1 Node3 K1 K1 K1 K1 K2 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 54 Hadoop INRIA S.IBRAHIM 54

55 Results Locality Range 1-85% 15-35% 1-50% 1-30% S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, L. Qi, Leen: Locality/fairness- aware key partitioning for mapreduce in the cloud, in: CLOUDCOM 10 Hadoop INRIA S.IBRAHIM 55

56 Open Issues Energy Efficency Data- centers consume large amount of energy» High monetary Power bills become a substantial part of the total cost of ownership (TCO) of data- center (40% of the total cost)» High carbon emission Hadoop INRIA S.IBRAHIM 56

57 Why Energy efficient Hadoop? THE engine for Big Data processing in data- centers Power bills become a substantial part of the total cost of ownership (TCO) of data- centers It is essential to explore and optimize the energy efficiency when running Big Data application in Hadoop Hadoop INRIA S.IBRAHIM 57

58 Energy- efficiency in Hadoop Hadoop INRIA S.IBRAHIM 58

59 DVFS in Hadoop? Diversity of MapReduce applications CPU load is high (98%) during almost 75% of the job running CPU load is high(80%) during only 15% of the job running Multiple phases of MapReduce application Disk I/O CPU Disk I/O Network There is a significant potential of energy saving by scaling down the CPU frequency when peak CPU is not needed Hadoop INRIA S.IBRAHIM 59 59

60 DVFS governors Performance Powersave OnDemand Conservative Userspace Maximize Performance Maximize power savings Power efficiency with reasonable performance Support for user- defined frequencies Static Dynamic Static Highest available frequency Lowest available frequency Gradually degrades the CPU frequency when the load is low Adjust to the highest available frequency when the load is high Gradually upgrades the CPU frequency when the load is high High power consumption Long response time Low performance/ power saving benefits when the system switches between idle states and heavy load often Worse performance than Ondemand Hadoop INRIA S.IBRAHIM 60

61 Power Consumption 40% Pi application Pi application Towards Efficient Power Management in MapReduce: Investigation of CPU- Frequencies Scaling on Power Efficiency in Hadoop, S. Ibrahim et al, ARMS CC Hadoop INRIA S.IBRAHIM 61

62 Performance vs Energy Pi application Towards Efficient Power Management in MapReduce: Investigation of CPU- Frequencies Scaling on Power Efficiency in Hadoop, S. Ibrahim et al, ARMS CC Hadoop INRIA S.IBRAHIM 62

63 Open Issue: Hadoop in Virtualized environment The applicability of MapReduce on Virtualized Cloud MapReduce Cloud Computing Virtual Machine Simplicity Virtual Fault Tolerance Machine Locality Optimization Load Balancing Scalability Various Implementations... Reliable and Robustness Dynamic scaling Simplicity Green Data Center Flexible Migration and Restart Capabilities Abstraction layer for the hardware, software, and configuration of systems Pay as you go with no lock-in Dynamic Hardware and Software Maintenance Security Resource Abstraction Performance Isolation Check Pointing / Migration Power Saving Optimize Resource Utilization. Hadoop INRIA S.IBRAHIM 63

64 Physical vs Virtual Hadoop Cluster HDFS on Physical Machine is Better than VMS Separate the permanent data- storage (DFS) from the virtual storage associated with VM. Evaluating mapreduce on virtual machines: The hadoop case, S Ibrahim, H Jin, L Lu, L Qi, S Wu, X Shi - Cloud Computing, 2009 Hadoop INRIA S.IBRAHIM 64

65 Multiple VM Deployment Deploying More VMs can improve the performance VMs within the same physical node are competing for the node I/O, causing poor performance. Evaluating mapreduce on virtual machines: The hadoop case, S Ibrahim, H Jin, L Lu, L Qi, S Wu, X Shi - Cloud Computing, 2009 Hadoop INRIA S.IBRAHIM 65

66 CLOUDLET CLOUDLET: towards mapreduce implementation on virtual machines, S Ibrahim, H Jin, B Cheng, H Cao, S Wu, L Qi. HPDC 2009 Hadoop INRIA S.IBRAHIM 66

67 Evaluation Three- Nodes Cluster Achieved up to 9% on very small Achieved up to 50% on very small cluster (WC) cluster (Sort) Hadoop INRIA S.IBRAHIM 67 Elapsed Time (Sec) Native Separate 512MB 1GB 2GB 1VM

68 Meta Disk I/O Scheduler Different Disk Schedulers in Xen:» CFQ, Anticipatory, Deadline, Noop» Scheduling in two level (VMM, VM) Selecting the appropriate pair schedulers can significantly affect the applications performance MapReduce is used to run different applications A typical MapReduce application consists of different interleaving stages, each requiring different I/O workloads and patterns Is their an optimal pair schedulers for MapReduce application? Hadoop INRIA S.IBRAHIM 68

69 Motivation Wordcount Wordcount w/o combiner Sort (A, C) à 4.5% (A,N) (D,N) à 6% (A,A) à 9% The disk pairs Schedulers are only sub- optimal for different MapReduce applications Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 69

70 Motivation A typical Hadoop application consists of different interleaving stages, each requiring different I/O workloads and patterns. The pairs schedulers are also sub- optimal for different sub- phases of the whole job. Hadoop INRIA S.IBRAHIM 70

71 Solution- Meta Scheduler A novel approach for adaptively tuning the disk scheduler pairs in both the hypervisor and the virtual machines during the execution of a single MapReduce job.» First, manually dividing the Hadoop program into phases based on the characteristics of Hadoop applications.» If there are phases in an application and a possible pair scheduler, there are unique solutions, where a solution is an assignment of disk pair schedulers to each phase.» there is a penalty of the switching overhead Novel heuristic that searches the space of solutions.» It executes a solution and evaluates the performance score including the switch cost. Based on the evaluation, it selects the next solution to evaluate. Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 71

72 Evaluation Adaptive disk i/o scheduling for mapreduce in virtualized environment, S Ibrahim, H Jin, L Lu, B He, S Wu. ICPP 2011 Hadoop INRIA S.IBRAHIM 72

73 Acknowledgments Dhruba Borthakur (Yahoo!) Vinod Kumar Vavilapalli (Hortonworks) Apache Hadoop Community Hadoop INRIA S.IBRAHIM 73

74 Thank you! Shadi Ibrahim 74

75 YARN : Next Generation Hadoop Adapted from a Presentation by Vinod Kumar Vavilapalli(Hortonworks), Hadoop Yarn, SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 75

76 Hadoop MapReduce Classic JobTracker Manages cluster resources and job scheduling TaskTracker Per- node agent Manage tasks Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 76

77 Current Limitations Scalability Maximum Cluster size 4,000 nodes Maximum concurrent tasks 40,000 Coarse synchronization in JobTracker Single point of failure Failure kills all queued and running jobs Jobs need to be re- submitted by users Restart is very tricky due to complex state Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 77

78 Current Limitations - 2 Hard partition of resources into map and reduce slots Low resource utilization Lacks support for alternate paradigms Iterative applications implemented using MapReduce are 10x slower Hacks for the likes of MPI/Graph Processing Lack of wire- compatible protocols Client and cluster must be of same version Applications and workflows cannot migrate to different clusters Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 78

79 Requirements Reliability Availability Utilization WireCompatibility Agility & Evolution Abilityforcustomerstocontrol upgrades to the grid software stack. Scalability- Clusters of 6,000-10,000 machines Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks 100,000+ concurrent tasks 10,000 concurrent jobs Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 79

80 Design Centre Split up the two major functions of JobTracker Cluster resource management Application life- cycle management MapReduce becomes user- land library Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 80

81 Architecture Application Application is a job submitted to the framework Example Map Reduce Job Container Basic unit of allocation Example container A = 2GB, 1CPU Replaces the fixed map/reduce slots Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 81

82 Architecture Resource Manager Global resource scheduler Hierarchical queues Node Manager Per- machine agent Manages the life- cycle of container Container resource monitoring Application Master Per- application Manages application scheduling and task execution E.g. MapReduce Application Master Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 82

83 Architecture Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 83

84 Improvements vs classic MapReduce Utilization Generic resource model Data Locality (node, rack etc.) Memory CPU Disk b/q Network b/w Remove fixed partition of map and reduce slot Scalability Application life- cycle management is very expensive Partition resource management and application life- cycle management Application management is distributed Hardware trends - Currently run clusters of 4,000 machines 6, machines > 12, machines <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB> Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 84

85 Improvements vsclassic MapReduce Fault Tolerance and Availability Resource Manager No single point of failure state saved in ZooKeeper (coming soon) Application Masters are restarted automatically on RM restart Application Master Optional failover via application- specific checkpoint MapReduce applications pick up where they left off via state saved in HDFS Wire Compatibility Protocols are wire- compatible Old clients can talk to new servers Rolling upgrades Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 85

86 Improvements vs classic MapReduce Innovation and Agility MapReduce now becomes a user- land library Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements Customers upgrade MapReduce versions on their schedule Users can customize MapReduce Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 86

87 Improvements vs classic MapReduce Support for programming paradigms other than MapReduce MPI Master- Worker Machine Learning Iterative processing Enabled by allowing the use of paradigm- specific application master Run all on the same Hadoop cluster Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 87

88 Any Performance Gains? Significant gains across the board, 2x in some cases. MapReduce Unlock lots of improvements from Terasort record (Owen/Arun, 2009) Shuffle 30%+ Small Jobs Uber AM Re- use task slots (containers) Hadoop Yarn, Vinod Kumar Vavilapalli,SF Hadoop Users Meetup Hadoop INRIA S.IBRAHIM 88

89 Thank you! Shadi Ibrahim Big Data Processing in the Cloud INRIA S.IBRAHIM 89

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur June, 2007 Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Big Data Management in the Clouds and HPC Systems

Big Data Management in the Clouds and HPC Systems Big Data Management in the Clouds and HPC Systems Hemera Final Evaluation Paris 17 th December 2014 Shadi Ibrahim Shadi.ibrahim@inria.fr Era of Big Data! Source: CNRS Magazine 2013 2 Era of Big Data! Source:

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Apache Hadoop FileSystem Internals

Apache Hadoop FileSystem Internals Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

YARN Apache Hadoop Next Generation Compute Platform

YARN Apache Hadoop Next Generation Compute Platform YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Hadoop Architecture and its Usage at Facebook

Hadoop Architecture and its Usage at Facebook Hadoop Architecture and its Usage at Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Microsoft Research, Seattle October 16, 2009 Outline Introduction

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

CURSO: ADMINISTRADOR PARA APACHE HADOOP

CURSO: ADMINISTRADOR PARA APACHE HADOOP CURSO: ADMINISTRADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 A developer has submitted a long running MapReduce job with wrong data sets. You

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Communicating with the Elephant in the Data Center

Communicating with the Elephant in the Data Center Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information