A very short Intro to Hadoop

Similar documents
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Google Bing Daytona Microsoft Research

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Implement Hadoop jobs to extract business value from large and varied data sets

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Intro to Map/Reduce a.k.a. Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

Data-Intensive Computing with Map-Reduce and Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MapReduce with Apache Hadoop Analysing Big Data


Jeffrey D. Ullman slides. MapReduce for data intensive computing

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data With Hadoop

Internals of Hadoop Application Framework and Distributed File System

How To Use Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop and Map-Reduce. Swati Gore

MapReduce Job Processing

Open source Google-style large scale data analysis with Hadoop

HDFS. Hadoop Distributed File System

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Apache HBase. Crazy dances on the elephant back

Constructing a Data Lake: Hadoop and Oracle Database United!

HADOOP PERFORMANCE TUNING

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

HADOOP MOCK TEST HADOOP MOCK TEST I

Parallel Processing of cluster by Map Reduce

Hadoop Ecosystem B Y R A H I M A.

Introduction to MapReduce and Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Hadoop. Alexandru Costan

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Large scale processing using Hadoop. Ján Vaňo

Deploying Hadoop with Manager

L1: Introduction to Hadoop

Hadoop Distributed File System (HDFS) Overview

Apache Hadoop: Past, Present, and Future

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Apache Hadoop new way for the company to store and analyze big data

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Map Reduce & Hadoop Recommended Text:

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Fault Tolerance in Hadoop for Work Migration

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Cloudera Certified Developer for Apache Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop Job Oriented Training Agenda

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop IST 734 SS CHUNG

Big Data Management and NoSQL Databases

MapReduce. Tushar B. Kute,

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Testing Big data is one of the biggest

A Brief Outline on Bigdata Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Using distributed technologies to analyze Big Data

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big Data - Infrastructure Considerations

Getting to know Apache Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Distributed Filesystems

Transcription:

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr

5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen

6 Hadoop to the Rescue Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer

7 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer

8 Storage Layer Hadoop Distributed File System (aka HDFS, or Hadoop DFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out

9 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special map and reduce operations

10 Scalable Cluster Rack Rack Rack Node Node Node Node... cpu Execution disk Storage Virtual execution and storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks.

11 Reliable Each block is replicated, typically three times. Each task must succeed, or the job fails

12 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time.

13 Simple Node Node Node Node... cpu Execution disk Storage Reduces complexity Conceptual operating system that spans many CPUs & disks

14 Typical Hadoop Cluster Has one master server - high quality, beefy box. NameNode process - manages file system JobTracker process - manages tasks Has multiple slave servers - commodity hardware. DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers

15 Architectural Components Solid boxes are unique applications Dashed boxes are child JVM instances (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) NameNode DataNode DataNode DataNode DataNode data block Client jobs JobTracker tasks TaskTracker mapper child mapper jvm child mapper jvm child jvm reducer child reducer jvm child reducer jvm child jvm Master Slaves

16 A very short Intro to Hadoop photo by: Graham Racher, flickr Distributed File System

17 Virtual File System Treats many disks on many servers as one huge, logical volume Data is stored in 1...n blocks The DataNode process manages blocks of data on a slave. The NameNode process keeps track of file metadata on the master.

18 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks. Protects data against disk, server, rack failures. Reduces the need to move data to code.

19 Error Recovery Slaves constantly check in with the master. Data is automatically replicated if a disk or server goes away.

20 Limitations Optimized for streaming data in/out So no random access to data in a file Data rates 30% - 50% of max raw disk rate No append support currently Write once, read many

21 NameNode Runs on master node Is a single point of failure There are no built-in software hot failover mechanisms Maintains filesystem metadata, the Namespace files and hierarchical directories

22 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are chunked and stored as blocks on DataNode Manages storage attached to the nodes that they run on Data never flows through NameNode, only DataNodes

23 A very short Intro to Hadoop photo by: Graham Racher, flickr Map-Reduce Paradigm

24 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The map function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The reduce function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values

25 All Together Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper

26 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,...}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,...}] Reduce [K3,V3]

27 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. < the,1> < quick,1> < brown,1> < fox,1> < jumped,1> Sum the values (ones) for each unique word

28 [1, When in the Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [K1,V1] [3, which have connected them with another, and to assume] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Defined Group [K2,{V2,V2,...}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [K3,V3] [When,4] [peope,6] [dissolve,3] [connected,5] [k,sum(v,v,v,..)] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

29 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key

30 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer split1 split2 split3 split4... file Mapper Mappers must complete before Reducers can begin part-00000 part-00001 part-000n directory Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

31 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes

32 JobTracker Client jobs JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use http://jobtracker-server:50030/

33 TaskTracker mapper child mapper jvm child mapper jvm child jvm TaskTracker reducer child reducer jvm child reducer jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization

34 A very short Intro to Hadoop photo by: Graham Racher, flickr Hadoop (sort of) Deep Thoughts

35 Avoiding Hadoop Hadoop is a big hammer - but not every problem is a nail Small data Real-time data Beware the Hadoopaphile

36 Avoiding Map-Reduce Writing Hadoop code is painful and error prone Hive & Pig are good solutions for People Who Like SQL Cascading is a good solution for complex, stable workflows

37 Leveraging the Eco-System Many open source projects built on top of Hadoop HBase - scalable NoSQL data store Sqoop - getting SQL data in/out of Hadoop Other projects work well with Hadoop Kafka/Scribe - getting log data in/out of Hadoop Avro - data serialization/storage format

38 Get Involved Join the mailing list - http://hadoop.apache.org/mailing_lists.html Go to user group meetings - e.g. http://hadoop.meetup.com/

39 Learn More Buy the book - Hadoop: The Definitive Guide, 2nd edition Try the tutorials http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Get training (danger, personal plug) http://www.scaleunlimited.com/training http://www.cloudera.com/hadoop-training

40 Resources Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html Users groups - e.g. http://hadoop.meetup.com/ Hadoop API - http://hadoop.apache.org/common/docs/current/api/ Hadoop: The Definitive Guide, 2nd edition by Tom White Cascading: http://www.cascading.org Datameer: http://www.datameer.com