MapReduce and Hadoop Distributed File System V I J A Y R A O

Similar documents

MapReduce and Hadoop Distributed File System

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop IST 734 SS CHUNG

Big Data and Apache Hadoop s MapReduce

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Parallel Processing of cluster by Map Reduce

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HiBench Introduction. Carson Wang Software & Services Group

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Architecture. Part 1

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Big Data Processing with Google s MapReduce. Alexandru Costan

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Open source Google-style large scale data analysis with Hadoop

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA USING HADOOP

MapReduce, Hadoop and Amazon AWS

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Introduction to Hadoop

Mobile Cloud Computing for Data-Intensive Applications

Hadoop and Map-Reduce. Swati Gore

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Cloud Computing at Google. Architecture

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Log Mining Based on Hadoop s Map and Reduce Technique

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

BIG DATA TRENDS AND TECHNOLOGIES

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Introduction to Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Lecture 10 - Functional programming: Hadoop and MapReduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Intro to Map/Reduce a.k.a. Hadoop

The Hadoop Framework

Hadoop Parallel Data Processing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA What it is and how to use?

Introduction to Hadoop

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Large scale processing using Hadoop. Ján Vaňo

Introduction to Big Data Science. Wuhui Chen

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A Performance Analysis of Distributed Indexing using Terrier

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Suresh Lakavath csir urdip Pune, India

Hadoop implementation of MapReduce computational model. Ján Vaňo

Survey on Scheduling Algorithm in MapReduce Framework

Big Data With Hadoop

Apache Hadoop. Alexandru Costan

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Implement Hadoop jobs to extract business value from large and varied data sets

MapReduce. Tushar B. Kute,

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

MAPREDUCE Programming Model

Scaling Out With Apache Spark. DTL Meeting Slides based on

Reduction of Data at Namenode in HDFS using harballing Technique

Introduction to DISC and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed File Systems

Comparative analysis of Google File System and Hadoop Distributed File System

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Tutorial for Assignment 2.0

GraySort on Apache Spark by Databricks

Transcription:

MapReduce and Hadoop Distributed File System 1 V I J A Y R A O

The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB a day (2008) 2010 census data is expected to be a huge gold mine of information mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. We are in a knowledge economy. is an important asset to any organization Discovery of knowledge; Enabling discovery; annotation of data We are looking at newer programming models, and Supporting algorithms and data structures. NSF refers to it as data-intensive computing and industry calls it bigdata and cloud computing 2

MapReduce 3 CCSCNE 2009 Palttsburg, April B.Ramamurthy & K.Madurai

What is MapReduce? MapReduce is a programming model Google has used successfully is processing its big-data sets (~ 20000 peta bytes per day) Users specify the computation in terms of a map and a reduce function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and 4 Underlying system also handles machine failures, efficient communications, and performance issues.

From CS Foundations to MapReduce Consider a large data : {web, weed, green, sun, moon, land, part, web, green, } Problem: Count the occurrences of the different words in the. Lets design a solution for this problem; We will start from scratch We will add and relax constraints We will do incremental design, improving the solution for performance and scalability 5

Word Counter and Result Table {web, weed, green, sun, moon, land, part, web, green, } 6 web 2 weed 1 Main WordCounter green 2 sun 1 moon 1 land 1 part 1 parse( ) count( ) Collection ResultTable

Multiple Instances of Word Counter 7 web 2 weed 1 Main green 2 sun 1 Thread WordCounter parse( ) count( ) moon 1 land 1 part 1 Collection ResultTable Observe: Multi-thread Lock on shared data

Improve Word Counter for Performance Main 8 N o No need for lock web 2 weed 1 green 2 sun 1 Parser Thread Counter moon 1 land 1 part 1 Collection WordList ResultTable Separate counters KEY web weed green sun moon land part web green. VALUE

Peta-scale 9 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

Addressing the Scale Issue Single machine cannot serve all the data: you need a distributed special (file) system Large number of commodity hardware disks: say, 1000 disks 1TB each Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time. Thus failure is norm and not an exception. File system has to be fault-tolerant: replication, checksum transfer bandwidth is critical (location of data) Critical aspects: fault tolerance + replication + load balancing, monitoring Exploit parallelism afforded by splitting parsing and counting Provision and locate computing at data locations 10

Peta-scale 11 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

Peta Scale is Commonly Distributed 12 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable Issue: managing the large scale data KEY web weed green sun moon land part web green. VALUE

Write Once Read Many (WORM) data 13 Main web 2 weed 1 green 2 Parser Thread Counter sun 1 moon 1 land 1 part 1 Collection WordList ResultTable KEY web weed green sun moon land part web green. VALUE

WORM is Amenable to Parallelism 14 Parser Main Thread Counter 1. with WORM characteristics : yields to parallel processing; 2. without dependencies: yields to out of order processing Collection WordList ResultTable

Divide and Conquer: Provision Computing at Location One node Parser Collection Main Thread WordList Counter ResultTable 15 For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Parser Thread Counter Our parse is a mapping operation: MAP: input <key, value> pairs Collection WordList ResultTable Parser Collection Main Thread WordList Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Parser Collection Main Thread WordList Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!

Mapper and Reducer 16 MapReduceTask Mapper Reducer YourMapper Parser YourReducer Counter Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source code

Map Operation MAP: Input data <key, value> pair Collection: split1 Collection: split 2 Split the data to Supply multiple processors 17 Map Map web 1 weed 1 green 1 web 1 sun 1 weed 1 moon 1 green 1 land 1 web sun 1 1 part 1 weed moon 1 1 web 1 green land 1 web 1 1 green 1 sun part 1 weed 1 1 web 1 1 moon web 1 green 1 1 weedkey 1 VALUE land green 1 sun 1 1 green 1 part 1 moon 1 1 sun 1 web KEY 1 land VALUE 1 moon 1 green 1 part 1 land 1 1 web 1 part 1 KEY VALUE green 1 web 1 1 green 1 KEY VALUE 1 Collection: split n KEY VALUE

Reduce Operation MAP: Input data <key, value> pair REDUCE: <key, value> pair <result> 18 Collection: split1 Split the data to Supply multiple processors Map Reduce Collection: split 2 Map Reduce Collection: split n Map Reduce

Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3

MapReduce Example 20 Cat split map combine reduce part0 split map combine reduce part1 Bat Dog split map combine reduce part2 Other Words (size: TByte) split map

Overall MapReduce Example

MapReduce Programming Model 22

MapReduce programming model Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set). Design and implement solution as Mapper classes and Reducer class. Compile the source code with hadoop core. Package the code as jar executable. Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams Load the data (or use it on previously available data) Launch the job and monitor. Study the result. Detailed steps. 23

MapReduce Characteristics Very large scale data: peta, exa bytes Write once and read many data: allows for parallelism without mutexes Map and Reduce are the main operations: simple code There are other supporting operations such as combine and partition (out of the scope of this talk). All the map should be completed before reduce operation starts. Map and reduce operations are typically performed by the same physical processor. Number of map tasks and reduce tasks are configurable. Operations are provisioned near the data. Commodity hardware and storage. Runtime takes care of splitting and moving data for operations. Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. 24

Classes of problems mapreducable Benchmark for comparing: Jim Gray s challenge on dataintensive computing. Ex: Sort Google uses it (we think) for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrial objects. Expected to play a critical role in semantic web and web3.0 25

size: small Pipelined Instruction level Scope of MapReduce 26 Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level size: large

Hadoop 27

What is Hadoop? 28 At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.

Basic Features: HDFS Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware 29

Hadoop Distributed File System 30 HDFS Server Master node HDFS Client Application Local file system Block size: 2K Name Nodes Block size: 128M Replicated

Hadoop Distributed File System 31 HDFS Server Master node blockmap HDFS Client Application heartbeat Local file system Block size: 2K Name Nodes Block size: 128M Replicated