MapReduce and Hadoop Distributed File System



Similar documents
MapReduce and Hadoop Distributed File System V I J A Y R A O

CSE-E5430 Scalable Cloud Computing Lecture 2

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Architecture. Part 1

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop IST 734 SS CHUNG

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Keywords: Big Data, HDFS, Map Reduce, Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Mobile Cloud Computing for Data-Intensive Applications

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MapReduce with Apache Hadoop Analysing Big Data

Big Data With Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Parallel Processing of cluster by Map Reduce

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Introduction to Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BIG DATA USING HADOOP

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Lecture 10 - Functional programming: Hadoop and MapReduce

Introduction to Hadoop

Intro to Map/Reduce a.k.a. Hadoop

MapReduce, Hadoop and Amazon AWS

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Open source Google-style large scale data analysis with Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Distributed File Systems

Evaluating HDFS I/O Performance on Virtualized Systems

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

A Performance Analysis of Distributed Indexing using Terrier

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop & its Usage at Facebook

Hadoop and Map-Reduce. Swati Gore

Data-Intensive Computing with Map-Reduce and Hadoop

The Hadoop Framework

Apache Hadoop. Alexandru Costan

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

HiBench Introduction. Carson Wang Software & Services Group

Reduction of Data at Namenode in HDFS using harballing Technique

Suresh Lakavath csir urdip Pune, India

Application Development. A Paradigm Shift

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop & its Usage at Facebook

Introduction to MapReduce and Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Processing of Hadoop using Highly Available NameNode

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Tutorial for Assignment 2.0

Cloud Computing at Google. Architecture

Fault Tolerance in Hadoop for Work Migration

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Introduction to Cloud Computing

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Scalable Cloud Computing

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop FileSystem and its Usage in Facebook

Accelerating and Simplifying Apache

BIG DATA What it is and how to use?

BIG DATA TRENDS AND TECHNOLOGIES

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Parallel Data Processing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Transcription:

MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially Supported by NSF DUE Grant: 0737243, 0920335

The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB a day (2008) 2010 census data is expected to be a huge gold mine of information mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. We are in a knowledge economy. is an important asset to any organization Discovery of knowledge; Enabling discovery; annotation of data We are looking at newer programming models, and Supporting algorithms and data structures. NSF refers to it as data-intensive computing and industry calls it bigdata and cloud computing 2

Purpose of this talk To provide a simple introduction to: The big-data computing : An important advancement that has a potential to impact significantly the CS and undergraduate curriculum. A programming model called MapReduce for processing big-data A supporting file system called Hadoop Distributed File System (HDFS) To encourage students to explore ways to infuse relevant concepts of this emerging area into their projects. To explore ways of contributing the HDFS project. 3

The Outline The concept Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model Hadoop Distributed File System Demo Our experience with the framework Summary References 4

The Concept 5

Big issues 6 Read Client: wordcount, index, pagerank Issue: Efficient parallel Processing of big data More than multithreading Algorithmic level Write Client: Web crawler Issues: Big data storage write once read many Distributed data server (not file server or dbms)

Big solutions? 7 Read Client: wordcount, index, pagerank Issue: Efficient parallel Processing of big data More than multithreading Algorithmic level Google s solution: mapreduce Write Client: Web crawler Issues: Big data storage write once read many Distributed data server (not file server or dbms) Google s solution: GFS Yahoo s answer: HDFS

MapReduce 8

What is MapReduce? MapReduce is a programming model Google has used successfully is processing its big-data sets (~ 20000 peta bytes per day) Users specify the computation in terms of a map and a reduce function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient communications, and performance issues. 9 -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.

From CS Foundations to MapReduce Consider a large data : {web, weed, green, sun, moon, land, part, web, green, } Problem: Count the occurrences of the different words in the. Lets design a solution for this problem; We will start from scratch We will add and relax constraints We will do incremental design, improving the solution for performance and scalability 10

Word Counter and Result Table {web, weed, green, sun, moon, land, part, web, green, } 11 web 2 weed 1 Main WordCounter green 2 sun 1 moon 1 land 1 part 1 parse( ) count( ) Collection ResultTable

Multiple Instances of Word Counter 12 web 2 weed 1 Main green 2 sun 1 Thread WordCounter parse( ) count( ) moon 1 land 1 part 1 Collection ResultTable Observe: Multi-thread Lock on shared data

Improve Word Counter for Performance Main 13 N o No need for lock web 2 weed 1 green 2 sun 1 moon 1 Parser Thread Counter land 1 part 1 Collection WordList ResultTable Separate counters KEY web weed green sun moon land part web green. VALUE

Peta-scale 14 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

Addressing the Scale Issue Single machine cannot serve all the data: you need a distributed special (file) system Large number of commodity hardware disks: say, 1000 disks 1TB each Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time. Thus failure is norm and not an exception. File system has to be fault-tolerant: replication, checksum transfer bandwidth is critical (location of data) Critical aspects: fault tolerance + replication + load balancing, monitoring Exploit parallelism afforded by splitting parsing and counting Provision and locate computing at data locations 15

Peta-scale 16 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

Peta Scale is Commonly Distributed 17 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable Issue: managing the large scale data KEY web weed green sun moon land part web green. VALUE

Write Once Read Many (WORM) data 18 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

WORM is Amenable to Parallelism 19 Main Parser Thread Counter 1. with WORM characteristics : yields to parallel processing; 2. without dependencies: yields to out of order processing Collection WordList ResultTable

Divide and Conquer: Provision Computing at Location One node Parser Collection Main Thread WordList Counter ResultTable 20 For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Parser Thread Counter Our parse is a mapping operation: MAP: input <key, value> pairs Collection WordList ResultTable Parser Collection Main Thread WordList Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Parser Collection Main Thread WordList Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!

Mapper and Reducer 21 Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source code

Map Operation MAP: Input data <key, value> pair 22 web 1 weed 1 green 1 sun 1 moon 1 land 1 Collection: split1 Split the data to Supply multiple processors Map part 1 web 1 green 1 1 KEY VALUE Collection: split 2 Map Collection: split n

Reduce Operation MAP: Input data <key, value> pair REDUCE: <key, value> pair <result> 23 Collection: split1 Split the data to Supply multiple processors Map Reduce Collection: split 2 Map Reduce Collection: split n Map Reduce

Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3 24

MapReduce Example in my operating systems class 25 Cat split map combine reduce part0 split map combine reduce part1 Bat Dog split map combine reduce part2 Other Words (size: TByte) split map

MapReduce Programming Model 26

MapReduce programming model Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set). Design and implement solution as Mapper classes and Reducer class. Compile the source code with hadoop core. Package the code as jar executable. Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams Load the data (or use it on previously available data) Launch the job and monitor. Study the result. Detailed steps. 27

MapReduce Characteristics Very large scale data: peta, exa bytes Write once and read many data: allows for parallelism without mutexes Map and Reduce are the main operations: simple code There are other supporting operations such as combine and partition (out of the scope of this talk). All the map should be completed before reduce operation starts. Map and reduce operations are typically performed by the same physical processor. Number of map tasks and reduce tasks are configurable. Operations are provisioned near the data. Commodity hardware and storage. Runtime takes care of splitting and moving data for operations. Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. 28

Classes of problems mapreducable Benchmark for comparing: Jim Gray s challenge on dataintensive computing. Ex: Sort Google uses it (we think) for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrial objects. Expected to play a critical role in semantic web and web3.0 29

Scope of MapReduce 30 size: small Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level size: large

Hadoop 31

What is Hadoop? 32 At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.

Basic Features: HDFS Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware 33

Hadoop Distributed File System 34 HDFS Server Master node HDFS Client Application Local file system Block size: 2K Name Nodes More details: We discuss this in great detail in my Operating Systems course Block size: 128M Replicated

Hadoop Distributed File System 35 HDFS Server Master node blockmap HDFS Client Application heartbeat Local file system Block size: 2K Name Nodes More details: We discuss this in great detail in my Operating Systems course Block size: 128M Replicated

Relevance and Impact on Undergraduate courses 36 structures and algorithms: a new look at traditional algorithms such as sort: Quicksort may not be your choice! It is not easily parallelizable. Merge sort is better. You can identify mappers and reducers among your algorithms. Mappers and reducers are simply place holders for algorithms relevant for your applications. Large scale data and analytics are indeed concepts to reckon with similar to how we addressed programming in the large by OO concepts. While a full course on MR/HDFS may not be warranted, the concepts perhaps can be woven into most courses in our CS curriculum.

Demo VMware simulated Hadoop and MapReduce demo Remote access to NEXOS systems lab 5-node HDFS running HDFS on Ubuntu 8.04 1 name node and 4 data-nodes Each is an old commodity PC with 512 MB RAM, 120GB 160GB external memory Zeus (namenode), datanodes: hermes, dionysus, aphrodite, athena 37

Summary 38 We introduced MapReduce programming model for processing large scale data We discussed the supporting Hadoop Distributed File System The concepts were illustrated using a simple example We reviewed some important parts of the source code for the example. Relationship to Cloud Computing

References 39 1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tu torial.html 2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. 3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic 4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html