Hadoop Architecture. Part 1



Similar documents
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop IST 734 SS CHUNG

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Distributed Filesystems

Distributed File Systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop. Alexandru Costan

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Distributed File System (HDFS) Overview

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Apache Hadoop new way for the company to store and analyze big data

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

HADOOP MOCK TEST HADOOP MOCK TEST I

HDFS Users Guide. Table of contents

Intro to Map/Reduce a.k.a. Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Reduction of Data at Namenode in HDFS using harballing Technique

HDFS Architecture Guide

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Distributed File Systems

Open source Google-style large scale data analysis with Hadoop

HDFS: Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Advance Research in Computer Science and Management Studies

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Fault Tolerance in Hadoop for Work Migration

Hadoop Parallel Data Processing


Parallel Processing of cluster by Map Reduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST

Processing of Hadoop using Highly Available NameNode

Introduction to HDFS. Prasanth Kothuri, CERN

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Data-Intensive Computing with Map-Reduce and Hadoop

A very short Intro to Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Hadoop: Embracing future hardware

The Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

MapReduce with Apache Hadoop Analysing Big Data

HADOOP MOCK TEST HADOOP MOCK TEST II

5 HDFS - Hadoop Distributed System

MapReduce, Hadoop and Amazon AWS

Generic Log Analyzer Using Hadoop Mapreduce Framework

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Design and Evolution of the Apache Hadoop File System(HDFS)

Networking in the Hadoop Cluster

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Optimize the execution of local physics analysis workflows using Hadoop

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Survey on Scheduling Algorithm in MapReduce Framework

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Data-intensive computing systems

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CURSO: ADMINISTRADOR PARA APACHE HADOOP

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

How To Use Hadoop

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Large scale processing using Hadoop. Ján Vaňo

Big Data and Apache Hadoop s MapReduce

Keywords: Big Data, HDFS, Map Reduce, Hadoop

BBM467 Data Intensive ApplicaAons

Getting to know Apache Hadoop

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

HadoopRDF : A Scalable RDF Data Analysis System

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Transcription:

Hadoop Architecture Part 1

Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes, such as Node 2, Node 3 and so on. This would be called a rack. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks. A Hadoop Cluster (or just cluster from now on) is a collection of racks. Hadoop has two major components: Hadoop Distributed File System MapReduce Engine HDFS (Hadoop Distributed File System) HDFS is a distributed file system designed to run on commodity hardware. HDFS runs on top of the existing file systems on each node in a Hadoop cluster. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Seek Operation The larger the file, the less time Hadoop spends seeking for the next data location on disk and the more time Hadoop runs at the limit of the bandwidth of your disks. Seeks are generally expensive operations that are useful when you only need to analyse a small

subset of your dataset. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. Sequential Access Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there. Hadoop uses blocks to store a file or parts of a file. Portability across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. HDFS Blocks: Blocks are large in HDFS. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger. File Blocks 64 MB (Default), 128 MB (recommended) Compare to 4KB in UNIX. A Hadoop block is a file on the underlying file system. Since the underlying file system stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system. Blocks have several advantages: They are fixed in size. This makes it easy to calculate how many can fit on a disk. File being made up of blocks that can be spread over multiple nodes, a file can be larger than any single disk in the cluster. HDFS blocks also don't waste space. If a file or chunk of the file is smaller than the block size, only needed space is used.

Example: 420 MB file is split as But the fourth block does not consume a full 128 MB. Blocks fit well with replication, which allows HDFS to be fault tolerant and available on commodity hardware. Each block is replicated to multiple nodes. For example, block 1 is stored on node 1 and node 2.This allows for node failure without data loss. If node 1 crashes, node 2 still runs and has block 1's data. In this example, we are only replicating data across two nodes, but you can set replication to be across many more nodes by changing Hadoop's configuration or even setting the replication factor for each individual file. MapReduce Engine Hadoop's MapReduce is inspired by a paper Google published on the MapReduce technology. A MapReduce program consists of two types of transformations that can be applied to data any number of times - a map function and a reduce function. Also function called as transformation. A MapReduce job is an executing MapReduce program that is divided into map tasks that run in parallel with each other and reduce tasks that run in parallel with each other. Types of Nodes HDFS Nodes NameNode DataNode MapReduce Nodes JobTracker TaskTracker

They are classified as HDFS or MapReduce nodes. For HDFS nodes we have the NameNode, and the DataNodes. For MapReduce nodes we have the JobTracker and the TaskTracker nodes. There are other HDFS nodes such as the Secondary NameNode, Checkpoint node, and Backup node. We will discuss this later. This diagram shows some of the communication paths between the different types of nodes on the system. A client is shown as communicating with a JobTracker. It can also communicate with the NameNode and with any DataNode. NameNode There is only one NameNode in the cluster. While the data that makes up a file is stored in blocks at the data nodes, the metadata for a file is stored at the NameNode. The NameNode is also responsible for the file system namespace. Since there is only one NameNode, one should configure the NameNode to write a copy of its state information to multiple locations, such as a local disk and an NFS mount. If there is one node in the cluster to spend money on the best enterprise hardware for maximum reliability it is the NameNode. The NameNode should also have as much RAM as possible because it keeps the entire file system metadata in memory. Note: Don t use inexpensive commodity hardware for NameNode and large memory required.

DataNodes A typical HDFS cluster has many DataNodes. They store the blocks of data. When a client requests a file, it finds out from the NameNode which DataNodes store the blocks that make up that file and the client directly reads the blocks from the individual DataNodes. Each DataNode also reports to the NameNode periodically with the list of blocks it stores. DataNodes do not require expensive enterprise hardware or replication at the hardware layer. The DataNodes are designed to run on commodity hardware and replication is provided at the software layer. JobTracker Node A JobTracker node manages MapReduce jobs. There is only one of these on the cluster. It receives jobs submitted by clients. It schedules the Map tasks and Reduce tasks on the appropriate TaskTrackers in a rackaware manner and monitors for any failing tasks that need to be rescheduled on a different TaskTracker. To achieve the parallelism for your map and reduce tasks, there are many TaskTrackers in a Hadoop cluster. Each TaskTracker seed Java Virtual Machines to run your map or reduce task.