Understanding Hadoop Clusters and the Network



Similar documents
Introduction to Big Data Science. Wuhui Chen

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Architecture. Part 1

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

HDFS: Hadoop Distributed File System

Hadoop Distributed File System. Dhruba Borthakur June, 2007

The Hadoop Distributed File System

Apache Hadoop: Past, Present, and Future

Apache Hadoop. Alexandru Costan

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

Hadoop IST 734 SS CHUNG

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

The Hadoop Distributed File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Technology Core Hadoop: HDFS-YARN Internals

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System (HDFS) Overview

Managing large clusters resources

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HDFS Reliability. Tom White, Cloudera, 12 January 2008

Distributed File Systems

Peers Techno log ies Pv t. L td. HADOOP

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

HDFS Users Guide. Table of contents

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

HADOOP MOCK TEST HADOOP MOCK TEST I

Chapter 7. Using Hadoop Cluster and MapReduce

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop Parallel Data Processing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HDFS Architecture Guide

Distributed Filesystems

Using RDBMS, NoSQL or Hadoop?

Hadoop: Embracing future hardware

I/O Considerations in Big Data Analytics

In Memory Accelerator for MongoDB

High Availability on MapR

The Big Data Recovery System for Hadoop Cluster

A very short Intro to Hadoop

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Deploying Hadoop with Manager

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Big data blue print for cloud architecture

High Availability and Clustering

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Architecture Guide Community release

Hadoop Cluster Applications

Hadoop Data Locality Change for Virtualization Environment

Data-intensive computing systems

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data With Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HADOOP PERFORMANCE TUNING

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

CDH AND BUSINESS CONTINUITY:

Data Domain Profiling and Data Masking for Hadoop

LOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT. A Thesis by. UdayKiran RajuladeviKasi. Bachelor of Technology, JNTU, 2008

HDFS Design Principles

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

The Hadoop Framework

Detailed Outline of Hadoop. Brian Bockelman

SwiftStack Global Cluster Deployment Guide

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

A Brief Outline on Bigdata Hadoop

MinCopysets: Derandomizing Replication In Cloud Storage

Internals of Hadoop Application Framework and Distributed File System

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HDFS. Hadoop Distributed File System

6. How MapReduce Works. Jari-Pekka Voutilainen

Transcription:

Understanding Hadoop lusters and the Network Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund

Hadoop Server Roles lients Distributed Data nalytics Map Reduce Distributed Data Storage HDFS Job Tracker Secondary masters & Task Tracker & Task Tracker & Task Tracker & Task Tracker & Task Tracker & Task Tracker slaves

Hadoop luster World Job Tracker Secondary NN lient Rack 1 Rack 2 Rack 3 Rack 4 Rack N

Typical Workflow Load data into the cluster (HDFS writes) nalyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word Fraud into emails sent to customer service? Huge file containing all emails sent to customer service

Blk Blk B Blk I want to write Blocks,B, of Writing files to HDFS lient OK. Write to s 1,5,6 1 5 6 N Blk Blk B Blk lient consults lient writes block directly to one s replicates block ycle repeats for next block

Hadoop Rack wareness Why? 1 B 2 B 3 5 Rack 1 5 6 7 8 Rack 5 9 B 10 11 12 Rack 9 Rack aware Rack 1: 1 2 3 Rack 5: 5 6 7 metadata = Blk : DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk : DN5, DN8,DN9 Never loose all data if entire rack fails Keep bulky flows in-rack when possible ssumption that in-rack is higher bandwidth, lower latency

Blk Blk B Blk I want to write Block Preparing HDFS writes Ready s 5,6 lient Ready! OK. Write to s 1,5,6 Rack aware Rack 1: 1 1 5 Rack 5: 5 6 Ready 6 Ready? 6 Rack 1 Rack 5 Ready! picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R

Pipelined Write Blk Blk B Blk s 1 & 2 pass data along as its received TP 50010 lient 1 5 6 Rack aware Rack 1: 1 Rack 5: 5 6 Rack 1 Rack 5

Pipelined Write Blk Blk B Blk lient Success Blk : DN1, DN2, DN3 Block received Rack 1: 1 1 2 Rack 5: 2 3 3 Rack 1 Rack 5

Multi-block Replication Pipeline Blk Blk B Blk lient 1TB File = 3TB storage 3TB network traffic Blk 1 Blk X Blk 2 Blk B Blk B Y Blk Blk 3 Blk W Blk B Z Rack 1 Rack 4 Rack 5

lient writes Span the HDFS luster lient Rack 1 Rack 2 Rack 3 Rack 4 Rack N Factors: Block size File Size More blocks = Wider spread

writes span itself, and other racks B B B Rack 1 Rack 2 Rack 3 Rack 4 Rack N Results.txt Blk Blk B Blk

wesome! Thanks. metadata File system DN1:, DN2:, DN3:, =, I have blocks:, I m alive! 1 2 3 N sends Heartbeats Every 10 th heartbeat is a Block report builds metadata from Block reports TP every 3 seconds If is down, HDFS is down

Re-replicating missing replicas Uh Oh! Missing replicas metadata DN1:, DN2:, DN3:, Rack wareness Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 opy blocks, to Node 8 1 2 3 8 Missing Heartbeats signify lost Nodes consults metadata, finds affected data consults Rack wareness script tells a to re-replicate

Secondary File system metadata =, Secondary Its been an hour, give me your metadata Not a hot standby for the onnects to every hour* Housekeeping, backup of metadata Saved metadata can rebuild a failed

lient reading files from HDFS Tell me the block locations of Results.txt Blk = 1,5,6 Blk B = 8,1,2 Blk = 5,8,9 lient 1 B 5 8 B metadata Results.txt= Blk : DN1, DN5, DN6 2 B 6 9 Blk B: DN7, DN1, DN2 Blk : DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 lient receives list for each block lient picks first for each block lient reads blocks sequentially

reading files from HDFS Tell me the locations of Block of Block = 1,5,6 1 B 2 B 3 5 6 8 B 9 Rack aware Rack 1: 1 2 3 Rack 5: 5 metadata = Blk : DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk : DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 provides rack local Nodes first Leverage in-rack bandwidth, single hop

Data Processing: Map How many times does Fraud appear in? lient Job Tracker ount Fraud in Block Map Task Map Task Map Task 1 5 9 B Fraud = 3 Fraud = 0 Fraud = 11 Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data

What if data isn t local? How many times does Fraud appear in? lient Job Tracker ount Fraud in Block I need block 1 no Map tasks left 2 Map Task Map Task 5 9 B Fraud = 0 Fraud = 11 Rack 1 Rack 5 Rack 9 Job Tracker tries to select Node in same rack as data rack awareness

Data Processing: Reduce lient Job Tracker Sum Fraud Results.txt Fraud = 14 X Y Z HDFS Reduce Task 3 Fraud = 0 Map Task Map Task Map Task 1 5 9 B Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS

Unbalanced luster NEW NEW **I was assigned a Map Task but don t have the block. Guess I need to get it. Rack 1 Rack 2 New Rack New Rack *I m bored! Hadoop prefers local processing if possible New servers underutilized for Map Reduce, HDFS* Might see more network bandwidth, slower job times**

luster Balancing NEW NEW Rack 1 Rack 2 New Rack New Rack brad@cloudera-1:~$hadoop balancer Balancer utility (if used) runs in the background Does not interfere with Map Reduce or HDFS Default speed limit 1 MB/s

Thanks! Narrated at: http://bradhedlund.com/?p=3108