Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems



Similar documents
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Distributed File Systems

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Management and NoSQL Databases

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Architecture. Part 1

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File Systems

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop IST 734 SS CHUNG

Big Data With Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HDFS Architecture Guide

Getting to know Apache Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Apache Hadoop. Alexandru Costan

Introduction to MapReduce and Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Distributed File System (HDFS)

BIG DATA APPLICATIONS

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Tutorial- Counting Words in File(s) using MapReduce

The Hadoop Eco System Shanghai Data Science Meetup

Data-Intensive Computing with Map-Reduce and Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Distributed File Systems

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Internals of Hadoop Application Framework and Distributed File System

Qsoft Inc

CSE-E5430 Scalable Cloud Computing Lecture 2

The Hadoop Distributed File System

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

The Google File System

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Mrs: MapReduce for Scientific Computing in Python

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Hadoop: Understanding the Big Data Processing Method

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Distributed File System (HDFS) Overview

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Distributed Filesystems

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop FileSystem and its Usage in Facebook

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

BBM467 Data Intensive ApplicaAons

Extreme Computing. Hadoop MapReduce in more detail.

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Apache HBase. Crazy dances on the elephant back

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

COURSE CONTENT Big Data and Hadoop Training

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Deploying Hadoop with Manager

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to HDFS. Prasanth Kothuri, CERN

BIG DATA, MAPREDUCE & HADOOP

Big Data Analytics* Outline. Issues. Big Data

HADOOP MOCK TEST HADOOP MOCK TEST II

Introduction to Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Accelerating and Simplifying Apache

5 HDFS - Hadoop Distributed System

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Parallel Processing of cluster by Map Reduce

Large scale processing using Hadoop. Ján Vaňo

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Complete Java Classes Hadoop Syllabus Contact No:

Distributed Lucene : A distributed free text index for Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop & its Usage at Facebook

Hadoop Ecosystem B Y R A H I M A.

HDFS: Hadoop Distributed File System

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

MapReduce with Apache Hadoop Analysing Big Data

Big Data 2012 Hadoop Tutorial

Processing of Hadoop using Highly Available NameNode

Big Data for the JVM developer. Costin Leau,

Transcription:

Processing of massive data: MapReduce 2. Hadoop 1

MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce: Simplified Data Processing on Large Clusters Also, of course, they were the first to develop a MapReduce framework Google has provided lots of information about how their MapReduce implementation and related technologies work, but have not released their software Using Google's seminal work, others have implemented their own MapReduce frameworks 2

MapReduce Implementations (2) Disco, by Nokia (http://discoproject.org/, based on Python) Skynet, by Geni (http://skynet.rubyforge.org/, based on Ruby) Some companies offer data analysis services based on their own MapReduce platforms (Aster Data, Greenplum) Hadoop (http://hadoop.apache.org/) is the most popular open-source MapReduce framework Amazon's Elastic MapReduce offers a ready-to-use Hadoop-based MapReduce service on top of their EC2 cloud. It is not a new MapReduce implementation, and it does not provide extra analytical services either 3

About Hadoop Hadoop is a project of the Apache Software Foundation (ASF) They develop software for the distributed processing of large data sets Several software projects are part of Hadoop, or are related with it: Core : Hadoop Common, Hadoop Distributed File System, Hadoop MapReduce Related: HBase, PigLatyn, Cassandra, Zookeeper... 4

Hadoop Core Projects Hadoop Common: common functionality used by the rest of Hadoop projects (serialization, RPC...) Hadoop Distributed File System (HDFS), based on Google File System (GFD), provides a distributed and fault-tolerant storage solution Hadoop MapReduce, implementation of MapReduce (the three projects are distributed toghether) 5

Hadoop Software Ecosystem Data analysis solutions Related projects Other Hadoop sub-projects Hadoop Core NoSQL Databases Cassandra (*) HBase Hive Mahout PigLatyn ZooKeeper (*) Hadoop MapReduce Hadoop Distributed File System Hadoop Core (*) ZooKeeper and Cassandra are related with Hadoop because they have similar funcionality, or are used by, Hadoop sub-projects. But they do not depend on Hadoop's sw 6

Hadoop Distributed File System Before focusing on Hadoop MapReduce we must understand how HDFS works HDFS is based on Google File System (GFS), which was created to meet Google's need for a distributed file system (DFS) It pursued the typical goals of a DFS: performance, availability, reliability, scalability But also it was created taken into account Google's environment features and apps needs HDFS was built following the same premises and architecture 7

HDFS Design Features HDFS design assumes that: Failures are the norm (not the exception) Constant monitoring, error detection, fault tolerance and automatic recovery are required Files will be big (TBs), and contain billions of app objects Small files are possible, but not a goal for optimization Once written and closed, files are only read and often only sequentially Batch-like analysis applications are the target Only one writer at any time High data access bandwidth is preferred to low latency for individual operations 8

HDFS Architecture HDFS files are split into blocks Two types of nodes: NameNode, unique, manages the namespace: maps file paths and names with blocks and their locations DataNode, keep data blocks and serves r/w requests from clients. Also, it decides where each data block replica is stored File path blocks DataNode replicas /path_to_file b1 b2 DataNode a, DataNode b, DataNode d DataNode b, DataNode c, DataNode d b1 b2 Client create, delete, open, close NameNode DataNode a DataNode b DataNode c DataNode d read/write b1 b2 b1 b2 9

HDFS Replicas Placement Where to store each block replica is decided by the NameNode Bandwidth among nodes in the same rack is greater than inter-racks Default: 3 copies per block, one on local rack and the two others on a remote rack NameNode Client DataNode c DataNode a DataNode d DataNode b DataNode e Rack Rack Switch 10

HDFS Replicas Placement Where to store each block replica is decided by the NameNode Bandwidth among nodes in the same rack is greater than inter-racks Default: 3 copies per block, one on local rack and the two others on a remote rack 1) Where to store data block b1? NameNode Client DataNode c DataNode a DataNode d DataNode b DataNode e Rack Rack Switch 11

HDFS Replicas Placement Where to store each block replica is decided by the NameNode Bandwidth among nodes in the same rack is greater than inter-racks Default: 3 copies per block, one on local rack and the two others on a remote rack 1) Where to store data block b1? NameNode Client DataNode c 2) DataNodes a, c y d DataNode a DataNode d DataNode b DataNode e Rack Rack Switch 12

HDFS Replicas Placement Where to store each block replica is decided by the NameNode Bandwidth among nodes in the same rack is greater than inter-racks Default: 3 copies per block, one on local rack and the two others on a remote rack 1) Where to store file? NameNode Client b1 DataNode c b1 2) DataNodes a, c y d DataNode a b1 DataNode d DataNode b DataNode e Rack Rack Switch 13

HDFS Replicas Placement (2) This placement police balances: Availability: three nodes, or two racks must fail to make the block unavailable Lower writing times: data moves only once through the switch High read throughput: readers will access to closer blocks; by this policy is easy that the reader is in the same node (or at least the same rack) that the block 14

HDFS Consistency Considerations HDFS keeps several replicas of each data block Given that: Once written and closed, files are only read There will be only one writer at any time Then consistency among replicas is greatly simplified GFS also makes some assumptions that simplify consistency management However it is not as restrictive as HDFS: Concurrent writers are allowed Potential inconsistencies must be addressed at app. level 15

HDFS NameNode Keeps the state of the HDFS, using An in-memory structure that stores the filesystem metadata A local file with a copy of that metadata at boot time (FsImage) A local file that logs all HDFS file operations (EditLog) At boot time the NameNode: Reads FsImage to get the HDFS state Applies all changes recorded in EditLog Saves the new state to the FsImage file (checkpoint) Truncates the EditLog file At run time the NameNode: Stores all file operations on the EditLog 16

HDFS Secondary NameNode Optionally, a secondary NameNode can be used It is used to merge the FsImage and EditLog of the NameNode to create a new FsImage This prevents the EditLog to grow without bonds It is a demanding process that demands time and would force the NameNode to be offline Once the new FsImage is built, is sent back to the NameNode The Secondary NameNode does not replace the primary NameNode Its copy of the FsImage could be used in case of failure, but it would lack the changes registered on the EditLog 17

HDFS Data Location Awareness Clients can query the NameNode about where data replicas are located (DataNodes URLs) This information can be used to move code instead of data Data can be in the order of TBs Moving data to the node where the processing sw is located would be inefficient Moving code to the host(s) where data is stored is faster and demands less resources 18

Data Correctness HDFS checks the correctness of data using checksums The last node in the pipeline verifies that the checksum of the received block is correct Also, each DataNode periodically checks all its blocks checksum Corrupted ones are reported to the NameNode, who will replace them with the block from other replica 19

Other HDFS Features/APIs APIs for Compression (zip, bzip...) Easy conversion to binary formats when reading/writing SequenceFile, used to store sequential data as binary records in the form key-value MapFile, is a SequenceFile that allows random access by key (i.e. it is a 'permanente' map) 20

Interacting with HDFS HDFS offers a Java API based on FileSystem Java apps can access HDFS as a typical file system There is a C API with a similar interface It is possible to access to HDFS from the shell using the hdfs application HDFS can be accessed remotely through a RPC interface (based on Apache Thrift middleware) WebDAV, FTP, HTTP 21

Hadoop MapReduce Once we have seen how HDFS works, we can focus on Hadoop MapReduce It is a distributed implementation of the MapReduce abstraction It can work on top of the local file system But it is intended to work on top of HDFS 22

Hadoop MapReduce Architecture A MapReduce processing is called a Job, each Job is split into Tasks, where each task executes map or reduce over data Two types of nodes: JobTracker, unique, manages the Jobs and monitors the TaskTrackers TaskTracker, executes Map and Reduce functions as commanded by the JobTracker Client run job JobTracker run map task run reduce task TaskTracker A TaskTracker B 23

Hadoop MapReduce Deployment Usually HDFS and MapReduce will run in the same datacenter The job will be programmed so input data and results are stored in HDFS Each physical host can run one (or more) DataNodes and one (or more) TaskTrackers This way, data locality can be exploited Client JobTracker TaskTracker A DataNode a NameNode TaskTracker B DataNode b 24

Hadoop Code Example Example of MapReduce programming Problem The input is a file that contains links between web pages as follows: http://host1/path1 > http://host2/path2 http://host1/path3 > http://host2/path4... The output is a file that for each web page gives the number of outgoing and incoming connections: http://host1/path1 2 4 http://host2/path2 0 1... 25

Hadoop Code Example, main public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { } public class IntSumReducer extends Reducer<Text,IntArrayWritable,Text,IntArrayWritable> { } public static void main(string[] args) { Job job = new Job(new Configuration(), "webgraph"); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intarraywritable.class); } 26

Hadoop Code Example, main public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { } public class IntSumReducer extends Reducer<Text,IntArrayWritable,Text,IntArrayWritable> { } public static void main(string[] args) { Job job = new Job(new Configuration(), "webgraph"); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intarraywritable.class); } 27

Hadoop Code Example, main public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { } public class IntSumReducer extends Reducer<Text,IntArrayWritable,Text,IntArrayWritable> { } public static void main(string[] args) { Job job = new Job(new Configuration(), "webgraph"); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intarraywritable.class); } 28

Hadoop Code Example, main public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { } public class IntSumReducer extends Reducer<Text,IntArrayWritable,Text,IntArrayWritable> { } public static void main(string[] args) { Job job = new Job(new Configuration(), "webgraph"); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intarraywritable.class); } 29

Hadoop Code Example, Mapper public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { } private final static IntWritable one = new IntWritable(1); private final static IntWritable cero = new IntWritable(0); private final static IntArrayWritable origin = new IntArrayWritable(new IntWritable[]{one, cero}); private final static IntArrayWritable destination = new IntArrayWritable(new IntWritable[]{cero, one}); private Text word = new Text(); 30

Hadoop Code Example, Mapper (2) public class TokenizerMapper extends Mapper<Object, Text, Text, IntArrayWritable> { public void map(object key, Text value, Context context) throws IOException, InterruptedException { BufferedReader reader = new BufferedReader(new StringReader(value.toString())); while(true) { String line = reader.readline(); if(line == null) { reader.close(); break; } line = line.trim(); if(line.isempty()) continue; String[] urls = line.split(webgraph.urls_separator); if(urls.length!= 2) { context.setstatus("malformed link found: " + value.tostring()); return; } String urlorigin = urls[0]; String urldest = urls[1]; word.set(urlorigin); context.write(word, origin); word.set(urldest); context.write(word, destination); } } } 31

Hadoop Code Example, Reducer public class IntSumReducer extends Reducer<Text,IntArrayWritable,Text,IntArrayWritable>{ private IntArrayWritable result = new IntArrayWritable(); public void reduce(text key, Iterable<IntArrayWritable> values, Context context) throws IOException, InterruptedException { int sumlinksorig = 0; int sumlinksdest = 0; for(intarraywritable val: values) { Writable[] intarray = val.get(); sumlinksorig += ((IntWritable)intArray[0]).get(); sumlinksdest += ((IntWritable)intArray[1]).get(); } result.set(new IntWritable[]{new IntWritable(sumLinksOrig), new IntWritable(sumLinksDest)}); context.write(key,result); } } 32

Bibliography (Paper) The Google File System. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. SOSP 2003 (Book) Hadoop: The Definitive Guide (2 nd edition). Tom White. Yahoo Press, 2010 33

(Example with Hadoop) 34