Hadoop IST 734 SS CHUNG



Similar documents
Hadoop Architecture. Part 1

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large scale processing using Hadoop. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

THE HADOOP DISTRIBUTED FILE SYSTEM

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

NoSQL and Hadoop Technologies On Oracle Cloud

Internals of Hadoop Application Framework and Distributed File System

Chapter 7. Using Hadoop Cluster and MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Open source large scale distributed data management with Google s MapReduce and Bigtable

A Brief Outline on Bigdata Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BIG DATA What it is and how to use?

APACHE HADOOP JERRIN JOSEPH CSU ID#

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

Big Data and Apache Hadoop s MapReduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

ITG Software Engineering

Hadoop and Map-Reduce. Swati Gore

Apache Hadoop. Alexandru Costan

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Qsoft Inc

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Deploying Hadoop with Manager

Workshop on Hadoop with Big Data

Xiaoming Gao Hui Li Thilina Gunarathne

Distributed Filesystems

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

How To Scale Out Of A Nosql Database

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Chase Wu New Jersey Ins0tute of Technology

Data-Intensive Computing with Map-Reduce and Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

BIG DATA TRENDS AND TECHNOLOGIES

MapReduce with Apache Hadoop Analysing Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop & its Usage at Facebook

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data With Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Distributed File System (HDFS) Overview

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Apache Hadoop: Past, Present, and Future

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Application Development. A Paradigm Shift

Certified Big Data and Apache Hadoop Developer VS-1221

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

NoSQL Data Base Basics

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Peers Techno log ies Pv t. L td. HADOOP

Big Data Course Highlights

Accelerating and Simplifying Apache

Constructing a Data Lake: Hadoop and Oracle Database United!

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HDFS. Hadoop Distributed File System

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Design and Evolution of the Apache Hadoop File System(HDFS)

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop: Embracing future hardware

Cloud Computing at Google. Architecture

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Using distributed technologies to analyze Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Networking in the Hadoop Cluster

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Generic Log Analyzer Using Hadoop Mapreduce Framework

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Intro to Map/Reduce a.k.a. Hadoop

Transcription:

Hadoop IST 734 SS CHUNG

Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to transmit 1TB of data through 4 channels : 43 Minutes. What if 500 TB?? 2

What is Hadoop? Framework for large-scale data processing Inspired by Google s architecture: GFS and MapReduce Open-source Apache project Written in Java and shell scripts 3

Where did Hadoop come from? Underlying technology invented by Google: Google File System and MapReduce Nutch search engine project Apache Incubator 4

Hadoop Distributed File System (HDFS) Storage unit of Hadoop Relies on principles of Distributed File System. HDFS have a Master-Slave architecture Main Components: Name Node : Master : Slave 3+ replicas for each block Default Block Size : 64MB 5

Hadoop Hadoop Distributed File System (HDFS) The file system is dynamically distributed across multiple computers Allows for nodes to be added or removed easily Highly scalable in a horizontal fashion Hadoop Development Platform Uses a MapReduce model for working with data Users can program in Java, C++, and other languages 6

Hadoop Some of the Key Characteristics of Hadoop: On-demand Services Rapid Elasticity Scalable Can add or remove nodes with little effort or reconfiguration Resistant to Failure Need more capacity, just assign some more nodes Individual node failure does not disrupt the system Uses off the shelf hardware 7

Hadoop How does Hadoop work? Runs on top of multiple commodity systems A Hadoop cluster is composed of nodes One Master Node Many Slave Nodes Multiple nodes are used for storing data & processing data System abstracts the underlying hardware to users/software 8

Hadoop: HDFS HDFS is a multi-node system Name Node (Master) 9 HDFS Consists of data blocks Single point of failure (Slave) Failure tolerant (Data replication) Files are divided into data blocks Default size if 64MB Default replication of blocks is 3 Blocks are spread out over Data Nodes

Hadoop Architecture Overview 10 Client Job Tracker Task Tracker Task Tracker Name Node

Hadoop Components: Job Tracker Client 11 Job Tracker Task Tracker Task Tracker Name Node Only one Job Tracker per cluster Receives job requests submitted by client Schedules and monitors jobs on task trackers

Hadoop Components: Name NodeClient 12 Job Tracker Task Tracker Task Tracker Name Node One active Name Node per cluster Manages the file system namespace and metadata Single point of failure: Good place to spend money on hardware

Hadoop Components: Task TrackerClient 13 Job Tracker Task Tracker Task Tracker Name Node There are typically a lot of task trackers Responsible for executing operations Reads blocks of data from data nodes

Hadoop Components: Client 14 Job Tracker Task Tracker Task Tracker Name Node There are typically a lot of data nodes Data nodes manage data blocks and serve them to clients Data is replicated so failure is not a problem

Why should I use Hadoop? Fault-tolerant hardware is expensive Hadoop designed to run on commodity hardware Automatically handles data replication and deals with node failure Does all the hard work so you can focus on processing data 15

HDFS: Key Features Highly fault tolerant. (automatic failure recovery system) High throughput Designed to work with systems with vary large file (files with size in TB) and few in number. Provides streaming access to file system data. It is specifically good for write once read many kind of files (for example Log files). Can be built out of commodity hardware. HDFS doesn't need highly expensive storage devices. 16

Who uses Hadoop? 17

What features does Hadoop offer? API and implementation for working with MapReduce Infrastructure Job configuration and efficient scheduling Web-based monitoring of cluster stats Handles failures in computation and data nodes Distributed File System optimized for huge amounts of data 18

When should you choose Hadoop? Need to process a lot of unstructured data Processing needs are easily run in parallel Batch jobs are acceptable Access to lots of cheap commodity machines 19

When should you avoid Hadoop? Intense calculations with little or no data Processing cannot easily run in parallel Data is not self-contained Need interactive results 20

Hadoop Examples Hadoop would be a good choice for: Indexing log files Sorting vast amounts of data Image analysis Search engine optimization Analytics Hadoop would be a poor choice for: Calculating Pi to 1,000,000 digits Calculating Fibonacci sequences A general RDBMS replacement 21

Hadoop Distributed File System HDFS is the Hadoop Distributed File System Runs entirely in userspace Inspired by the Google File System High aggregate throughput for streaming large files Supports replication and locality features 22

How HDFS works: Split Data Data copied into HDFS is split into blocks Typical HDFS block size is 128 MB (Vs. 4 KB on typical UNIX file systems) 23

How HDFS works: Replication 24 Each block is replicated to multiple machines This allows for node failure without data loss Block #1 Block #2 Block #1 1 2 3 Block #2 Block #3 Block #3

HDFS Architecture 25

Hadoop Modes of Operation Hadoop supports three modes of operation: Standalone Pseudo-distributed Fully-distributed 26

Name Node Master of HDFS Maintains and Manages data on s High reliability Machine (can be even RAID) Expensive Hardware Stores NO data; Just holds Metadata! Secondary Name Node: Reads from RAM of Name Node and stores it to hard disks periodically. Active & Passive Name Nodes from Gen2 Hadoop 27

s Slaves in HDFS Provides Data Storage Deployed on independent machines Responsible for serving Read/Write requests from Client. The data processing is done on s. 28

HDFS Operation 29

HDFS Operation Client makes a Write request to Name Node Name Node responds with the information about on available data nodes and where data to be written. Client write the data to the addressed. Replicas for all blocks are automatically created by Data Pipeline. If Write fails, will notify the Client and get new location to write. If Write Completed Successfully, Acknowledgement is given to Client Non-Posted Write by Hadoop 30

HDFS: File Write 31

HDFS: File Read 32

Hadoop: Hadoop Stack Hadoop Development Platform User written code runs on system System appears to user as a single entity User does not need to worry about distributed system Many system can run on top of Hadoop Allows further abstraction from system 33

Hadoop: Hive & HBase Hive and HBase are layers on top of Hadoop HBase & Hive are applications Provide an interface to data on the HDFS Other programs or applications may use Hive or HBase as an intermediate layer HBase ZooKeeper 34

Hadoop: Hive Hive Data warehousing application SQL like commands (HiveQL) Not a traditional relational database Scales horizontally with ease Supports massive amounts of data* * Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010) 35

Hadoop: HBase HBase No SQL Like language Uses custom Java API for working with data Modeled after Google s BigTable Random read/write operations allowed Multiple concurrent read/write operations allowed 36

Hadoop MapReduce Hadoop has it s own implementation of MapReduce Hadoop 1.0.4 API: http://hadoop.apache.org/docs/r1.0.4/api/ Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tu torial.html Custom Serialization Data Types Writable/Comparable Text vs String LongWritable vs long IntWritable vs int DoubleWritable vs double 37