Introduction to Hadoop

Similar documents

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Big Data With Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data and Apache Hadoop s MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop IST 734 SS CHUNG

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop. Alexandru Costan

Large scale processing using Hadoop. Ján Vaňo

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

How To Use Hadoop

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Map Reduce & Hadoop Recommended Text:

Cloud Computing at Google. Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop and Map-Reduce. Swati Gore

Extreme computing lab exercises Session one

HDFS. Hadoop Distributed File System

Open source Google-style large scale data analysis with Hadoop

Distributed Filesystems

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Chapter 7. Using Hadoop Cluster and MapReduce

How To Scale Out Of A Nosql Database

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Map Reduce / Hadoop / HDFS

Cloudera Certified Developer for Apache Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce and Hadoop Distributed File System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Recommended Literature for this Lecture

Hadoop Parallel Data Processing

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

BIG DATA TECHNOLOGY. Hadoop Ecosystem

NoSQL Data Base Basics

Hadoop Architecture. Part 1

L1: Introduction to Hadoop

Challenges for Data Driven Systems

BIG DATA, MAPREDUCE & HADOOP

Application Development. A Paradigm Shift

Xiaoming Gao Hui Li Thilina Gunarathne

Data Mining in the Swamp

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Jeffrey D. Ullman slides. MapReduce for data intensive computing

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Internals of Hadoop Application Framework and Distributed File System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Data-Intensive Computing with Map-Reduce and Hadoop

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A very short Intro to Hadoop

Hadoop & its Usage at Facebook

Constructing a Data Lake: Hadoop and Oracle Database United!

Infrastructures for big data

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

CS 378 Big Data Programming

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Big Data/Data Analytics & Software Development

Advanced Data Management Technologies

Introduction to MapReduce and Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

BBM467 Data Intensive ApplicaAons

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Cloud Computing. Chapter Hadoop

Extreme computing lab exercises Session one

Apache Hadoop FileSystem and its Usage in Facebook

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Big Data Too Big To Ignore

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Transcription:

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1

Background Hadoop Programming Model Examples Efficiency Examples Critique Miles Osborne Introduction to Hadoop 2

Background MR is a parallel programming model and associated infra-structure introduced by Google in 2004: Assumes large numbers of cheap, commodity machines. Failure is a part of life. Tailored for dealing with Big Data Simple Scales well Miles Osborne Introduction to Hadoop 3

Background Early Google Server (source: nialkennedy, flickr) Miles Osborne Introduction to Hadoop 4

Background Who uses it? Google (more than 1 million cores, rumours have it) Yahoo! (more than 100K cores) Facebook (8.8k cores, 12 PB storage) Twitter IBM Amazon Web services Edinburgh (!) Many many small start-ups http://wiki.apache.org/hadoop/poweredby Miles Osborne Introduction to Hadoop 5

Background Hadoop @ Edinburgh: Original small cluster (6 nodes), running 24/7 for two years Research cluster (120 nodes) Teaching cluster (54 nodes) Miles Osborne Introduction to Hadoop 6

Source: Zhao et al, Sigmetrics 09 Miles Osborne Introduction to Hadoop 7

Miles Osborne Introduction to Hadoop 8

Components Major components: 1: MR task scheduling and environment Running jobs, dealing with moving data, coordination, failures etc 2: Distributed File System (DFS) Storing data in a robust manner across a network; moving data to nodes 3: Distributed Hash Table (BigTable) Random-access to data that is shared across the network Hadoop is an open-source version of 1 and 2; HBase (etc) are similar to 3 Miles Osborne Introduction to Hadoop 9

MR Tasks are run in parallel across the cluster(s): Computation moves to the data. Multiple instances of a task may be run at once Speculative execution guards against task failure Tasks can be run rack-aware: Tasks access data that is within the rack they are running on Miles Osborne Introduction to Hadoop 10

DFS Data is stored across one or more clusters: Files are stored in blocks Blocks size is optimised for disk-cache size (often 64M) Blocks are replicated across the network Replication adds fault tolerance Replication increases the chance that the data is on the same machine as the task needing it Blocks are read sequentially and written sequentially Blocks are also spread evenly across the cluster Miles Osborne Introduction to Hadoop 11

MR and DFS Tasks and data are centrally managed: Dash-board to monitor and manage progress Under Hadoop, this is a single-point of failure Possibility of moving jobs across data centres Take advantage of cheap electricity Deals with load-balancing, disasters etc Miles Osborne Introduction to Hadoop 12

BigTable BigTable is a form of Database: Based on shared-nothing architecture Petabyte scaling, across thousands of machines Has a simple data model Designed for managing structured data Storing Web pages, URLs, etc Key-value pairs BigTable provides random access to data Can be used as a source and sink for MR jobs Miles Osborne Introduction to Hadoop 13

Hadoop Background Hadoop Programming Model Examples Efficiency Examples Critique Miles Osborne Introduction to Hadoop 14

Hadoop: History Nutch started in 2002 by Doug Cutting and Mike Cazfarella Early open-source web-search engine Written in Java Realisation that it would not scale for the Web 2004: Google MR and GFS papers appeared 2005: Working MR implementation for Nutch 2006: Hadoop became standalone project 2008: Hadoop broke world record for sorting 1TB of data Miles Osborne Introduction to Hadoop 15

Overview Set of components (Java), implementing most of MR-related ecosystem: MapReduce HDFS (Hadoop distributed filesytem) Services on top: HBase (BigTable) Pig (sql-like job control) Job-control Miles Osborne Introduction to Hadoop 16

Overview Hadoop supports a variety of ways to implement MR jobs: Natively, as java Using the streaming interface Mappers and reducers can be in any language Performance penalty, restricted functionality C++ hooks etc Miles Osborne Introduction to Hadoop 17

Running a job To run a streaming job: hadoop jar \ contrib/streaming/hadoop-*-streaming.jar \ -mapper map -reducer red -input in \ -output out -file map -file red map and red are the actual MR program files in is the input DFS and out is the output DFS directory Miles Osborne Introduction to Hadoop 18

Job Control There is a graphical UI: http://namenode.inf.ed.ac.uk:50030/jobtracker.jsp And also a command-line interface: hadoop job -list hadoop job -status jobid hadoop job -kill jobid Miles Osborne Introduction to Hadoop 19

Hadoop Background Hadoop Programming Model Examples Efficiency Examples Critique Miles Osborne Introduction to Hadoop 20

Programming Model MR offers one restricted version of parallel programming: Coarse-grained. No inter-process communication. Communication is (generally) through files. Miles Osborne Introduction to Hadoop 21

Programming Model M1 R1 Reducer output1 Shard1 Shard2 Shard3 M2 M3 R2 Reducer output2 Shard4 Shard5 Input M4 M5 R3 Reducers Reducer output3 Mappers Intermediate output Output Miles Osborne Introduction to Hadoop 22

Programming Model Mapping: The input data is divided into shards. The Map operation works over each shard and emits key-value pairs. Each mapper works in parallel. Keys and values can be anything which can be represented as a (serialised) string. Miles Osborne Introduction to Hadoop 23

Programming Model Reducing: After mapping, each key-value pair is hashed on the key. Hashing sends that key-value pair to a given reducer. All keys that hash to the same value are sent to the same reducer. The input to a reducer is sorted on the key. Sorted input means that related key-value pairs are locally grouped together. Miles Osborne Introduction to Hadoop 24

Programming Model Note: Each Mapper and Reducer runs in parallel. There is no state sharing between tasks. Task communication is achieved using either external resources or at start-time There need not be the same number of Mappers as Reducers. It is possible to have no Reducers. Miles Osborne Introduction to Hadoop 25

Programming Model Note: Tasks read their input sequentially. Sequential disk reading is far more efficient than random access Reducing starts once Mapping ends. Sorting and merging etc can be interleaved. Miles Osborne Introduction to Hadoop 26

Example: System Maintenance Example Check the health of a set of machines. Report on any that are bad. The input data will be a list of machine names borg1.com, borg2,com,...borg12123.com The output will be reports There is no need to run a reducing stage Miles Osborne Introduction to Hadoop 27

Example: System Maintenance Mapper: Read each line, giving a machine name, Log into that machine and collect statistics. Emit statistics Miles Osborne Introduction to Hadoop 28

Example: Word Counting Example Count the number of words in a collection of documents Our Mapper counts words in each shard. The Reducer gathers together partial counts for a given word and sums them Miles Osborne Introduction to Hadoop 29

Example: Word Counting Mapper: For each sentence, emit word, 1 pair. The key is the word The value is the number 1 Miles Osborne Introduction to Hadoop 30

Example: Word Counting Mapper: 1 while!eof(stdin) do 2 line = readline(stdin) 3 wordlist = split(line) 4 foreach word in wordlist do 5 print word TAB 1 NEWLINE 6 end 7 end Miles Osborne Introduction to Hadoop 31

Example: Word Counting Reducer: Each Reducer will see all instances of a given word. Sequential reads of the reducer input give partial counts of a word. Partial counts can be summed to give the total count. Miles Osborne Introduction to Hadoop 32

Example: Word Counting Reducer 1 prevword = ; valuetotal = 0 2 while!eof(stdin) do 3 line = readline(stdin); (word,value) = split(line) 4 if word eq prevword or prevword eq then 5 valuetotal += value 6 prevword = word 7 else 8 print prevword valuetotal NEWLINE 9 prevword = word; valuetotal = value 10 end 11 end 12 print word valuetotal NEWLINE Miles Osborne Introduction to Hadoop 33

Example: Word Counting Input sentences: the cat the dog Key Value the 1 cat 1 the 1 dog 1 Mapper output Miles Osborne Introduction to Hadoop 34

Example: Word Counting Reducer 1 input Reducer 2 input the, 1 cat, 1 the, 1 dog, 1 Reducer 1 output Reducer 2 output the, 2 cat, 1 dog, 1 Miles Osborne Introduction to Hadoop 35

Hadoop Background Hadoop Programming Model Examples Efficiency Examples Critique Miles Osborne Introduction to Hadoop 36

Map Reduce Efficiency MR algorithms involve a lot of disk and network traffic: We typically start with Big Data Mappers can produce intermediate results that are bigger than the input data. Task input may not be on the same machine as that task. This implies network traffic Per-reducer input needs to be sorted. Miles Osborne Introduction to Hadoop 37

Map Reduce Efficiency Sharding might not produce a balanced set of inputs for each Reducer: Often, the data is heavily skewed Eg all function words might go to one Reducer Having an imbalanced set of inputs turns a parallel algorithm into a sequential one Miles Osborne Introduction to Hadoop 38

Map Reduce Efficiency Selecting the right number of Mappers and Reducers can improve speed More tasks mean each task might fit in memory / require less network access More tasks mean that failures are quicker to recover from. Fewer tasks have less of an over-head. This is a matter of guess-work Miles Osborne Introduction to Hadoop 39

Map Reduce Efficiency Algorithmically, we can: Emit fewer key-value pairs Each task can locally aggregate results and periodically emit them. (This is called combining) Change the key Key selection implies we partition the output. Some other selection might partition it more evenly Miles Osborne Introduction to Hadoop 40

Example: Word Counting Redux Improve the performance of Word Counting: Use HDFS Specify the MR program Run the job Note: all commands are for Hadoop 0.19 Miles Osborne Introduction to Hadoop 41

Example: Word Counting Redux First need to upload data to HDFS Hadoop has a set of filesystem-like commands to do this: hadoop dfs -mkdir data hadoop dfs -put file.text data/ This creates a new directory and uploads the file file.txt to HDFS We can verify that it is there: hadoop dfs -ls data/ Miles Osborne Introduction to Hadoop 42

Example: Word Counting Redux Mapper: Using Streaming, a Mapper reads from STDIN and writes to STDOUT Keys and Values are delimited (by default) using tabs. Records are split using newlines Miles Osborne Introduction to Hadoop 43

Example: Word Counting Redux Mapper: 1 while!eof(stdin) do 2 line = readline(stdin) 3 wordlist = split(line) 4 foreach word in wordlist do 5 print word TAB 1 NEWLINE 6 end 7 end Miles Osborne Introduction to Hadoop 44

Example: Word Counting Redux Reducer 1 prevword = ; valuetotal = 0 2 while!eof(stdin) do 3 line = readline(stdin); (word,value) = split(line) 4 if word eq prevword or prevword eq then 5 valuetotal += value 6 prevword = word 7 else 8 print prevword valuetotal NEWLINE 9 prevword = word; valuetotal = value 10 end 11 end 12 print word valuetotal NEWLINE Miles Osborne Introduction to Hadoop 45

Example: Word Counting Redux Improving the Mapper 1 wordscounts = {} 2 while!eof(stdin) do 3 line = readline(stdin) 4 wordlist = split(line) 5 foreach word in wordlist do 6 wordcounts{word}++ 7 end 8 end 9 foreach word in wordlist do 10 count = wordcounts{word} 11 print word TAB count NEWLINE 12 end Miles Osborne Introduction to Hadoop 46

Example: Word Counting Redux Our improved Mapper: Only emits one word-count pair, per word and shard This uses a Combiner technique Uses an unbounded amount of memory Word counting 2 million tokens (Unix MR simulation) Mapper Naive Combiner Time 1 minute 5 sec 10 sec Miles Osborne Introduction to Hadoop 47

Hadoop Background Hadoop Programming Model Examples Efficiency Examples Critique Miles Osborne Introduction to Hadoop 48

Critique MR has generated a lot of interest: It solves all scaling problems! Google use it, so it must be great Start-ups etc love it and they generate a lot of chatter in the Tech Press Big companies use DBs and they don t talk about it Who needs complicated, expensive DBs anyway Miles Osborne Introduction to Hadoop 49

Background Google Trends: Blue (DBMS mentions), Red (Hadoop mentions) Miles Osborne Introduction to Hadoop 50

Critique Stonebraker et al considered whether MR can replace parallel databases P-DBs have been in development for 20+ years Robust, fast, scalable Based upon declarative data models Miles Osborne Introduction to Hadoop 51

Critique Which application classes might MR be a better choice than a P-DB? Extract-transform-load problems Read data from multiple sources Parse and clean it Transform it Store some of the data Complex analytics Multiple passes over the data Computing complex reports etc Semi-structured data No single scheme for the data (eg logs from multiple sources) Quick-and-dirty analyses Asking questions over the data, with minimum fuss and effort Miles Osborne Introduction to Hadoop 52

Critique Resulted indicated: For a range of core tasks, a P-DB was faster than Hadoop P-DBs are flexible enough to deal eg with semi-structured data (Unclear whether this is implementation-specific) Hadoop was criticised as being too low-level Higher-level abstractions such as Pig might help Hadoop was easier for quick-and-dirty tasks Writing MR jobs can be easier than complex SQL queries Non-specialists can quickly write MR jobs Hadoop is a lot cheaper Miles Osborne Introduction to Hadoop 53

Critique MR is not really suited for low-latency problems: Batch nature and lack of real-time guarantees means you shouldn t use it for front-end tasks MR is not a good fit for problems which need global state information: Many Machine Learning algorithms require maintenance of centralised information and this implies a single task Use the right tool for the job Miles Osborne Introduction to Hadoop 54

Comments Hadoop is sufficiently mature that is can be used for production: Good for batch-based processing Scales well on commodity boxes Cheap and easy to install Large open-source community Cloudera offer commericial support Miles Osborne Introduction to Hadoop 55