Introduction to Hadoop. New York Oracle User Group Vikas Sawhney



Similar documents
Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data With Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Qsoft Inc

A very short Intro to Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce with Apache Hadoop Analysing Big Data

A Brief Outline on Bigdata Hadoop

Hadoop Ecosystem B Y R A H I M A.

Hadoop Architecture. Part 1

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Apache Hadoop new way for the company to store and analyze big data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Chapter 7. Using Hadoop Cluster and MapReduce

Chase Wu New Jersey Ins0tute of Technology

Hadoop IST 734 SS CHUNG

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Large scale processing using Hadoop. Ján Vaňo

How To Scale Out Of A Nosql Database

Apache Hadoop. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Internals of Hadoop Application Framework and Distributed File System

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Certified Big Data and Apache Hadoop Developer VS-1221

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Course Highlights

Deploying Hadoop with Manager

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop. Sunday, November 25, 12

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Peers Techno log ies Pv t. L td. HADOOP

Big Data Technology Core Hadoop: HDFS-YARN Internals

Open source Google-style large scale data analysis with Hadoop

L1: Introduction to Hadoop

Hadoop Job Oriented Training Agenda

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Map Reduce & Hadoop Recommended Text:

Accelerating and Simplifying Apache

Complete Java Classes Hadoop Syllabus Contact No:

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

HADOOP MOCK TEST HADOOP MOCK TEST I

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Data-Intensive Computing with Map-Reduce and Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

HadoopRDF : A Scalable RDF Data Analysis System

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

THE HADOOP DISTRIBUTED FILE SYSTEM

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

COURSE CONTENT Big Data and Hadoop Training

Testing Big data is one of the biggest

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Big Data Introduction

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Application Development. A Paradigm Shift

Big Data Management and NoSQL Databases

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Extending Hadoop beyond MapReduce

Cloud Scale Distributed Data Storage. Jürmo Mehine

Microsoft SQL Server 2012 with Hadoop

Getting to know Apache Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

HADOOP. Revised 10/19/2015

ITG Software Engineering

BIG DATA PROCESSING WITH HADOOP

Constructing a Data Lake: Hadoop and Oracle Database United!

A survey of big data architectures for handling massive data

HADOOP MOCK TEST HADOOP MOCK TEST

Transcription:

Introduction to Hadoop New York Oracle User Group Vikas Sawhney

GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop Releases Hadoop 2.0

NO-SQL BIG Data Revolution Ease of elastic scalability horizontal scalability buy bigger hardware box with higher costs Costs associated with RDMBS to achieve horizontal (add hardware) scalability and vertically (larger boxes) Schema-less design for dynamic streams which does not find into a standard data model Ability to manage large volume, velocity, and variety of data. RDBMS databases are feature rich but lose on some of the scalability feature due various constraints referential integrity and data consistency

NO-SQL DATABASE DEFINED A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g., key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though and the particular suitability of a given NoSQL DB depends on the problem to be solved (e.g., does the solution use graph algorithms?). NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as "Not only SQL" to emphasize that they may in fact allow SQL-like query languages to be used. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and the huge investments already made in SQL by enterprises Ref: Wikipedia

CAP Theorem

NO-SQL Database using the Hadoop framework Hadoop refers to an ecosystem of software packages, including MapReduce, HDFS, and a whole host of other software packages to support the import and export of data into and from HDFS (the Hadoop Distributed FileSystem). When someone says, "I have a Hadoop cluster," they generally mean a cluster of machines all running in this general ecosystem with a large distributed filesystem to support large scale computation. NoSQL is referring to non-relational or at least non-sql database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB,

The NYC Subway MAP of Data Platforms

HADOOP Open source software platform for scalable, distributed computing framework for analysis of data Hadoop provides fast and reliable analysis of both structured data, unstructured data, polymorphic data Apache Hadoop is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.

HADOOP ARCHITECTURE

Hadoop ARCHITECTURE

Hadoop cluster requirements Prerequisites: System: Mac OS / Linux Java 6 installed Dedicated user for hadoop SSH configured

HDFS Daemons Name Node : Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file data is kept Data Node : DataNode actually stores data in the Hadoop Filesystem Secondary Name Node : perform periodic checkpoints to Name Node Metadata

Name Node

DATA Replication

HDFS ARCHITECTURE

HDFS A distributed file system that provides high-throughput access to application data HDFS uses a master/slave architecture in which one device (master) termed as NameNode controls one or more other devices (slaves) termed as DataNode. It breaks Data/Files into small blocks (128 MB each block) and stores on DataNode and each block replicates on other nodes to accomplish fault tolerance. NameNode keeps the track of blocks written to the DataNode.

HDFS DAEMONS NAME NODE DATA NODE SECONDRY NAME NODE

NAME NODE Keeps the metadata of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives NameNode Stores metadata in two files fsimage, Edit log

NAME NODE (Contd.) fsimage - Its the snapshot of the filesystem when NameNode started Edit logs - Its the sequence of changes made to the filesystem after NameNode started

DATA NODE DataNode stores data in the Hadoop Filesystem A functional filesystem has more than one DataNode, with data replicated across them On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data

SECONDRY NAME NODE The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary namenode, and in no event it can replace the primary name-node in case of its failure. The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary namenode periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node

SECONDRY NAME NODE (Contd.)

NAME NODE INTERFACE

APACHE MAP/REDUCE A software framework for distributed processing of large data sets The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. It splits the input data set into independent chunks that are processed in a completely parallel manner. MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

MAP REDUCE DATAFLOW An input reader A Map function A partition function A compare function A Reduce function An output writer

MAP/REDUCE ARCHITECTURE

MAPREDUCE DAEMONS JOB TRACKER TASK TRACKER

JOBTRACKER Hadoop v1 JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop JobTracker performs following actions in Hadoop : It accepts the MapReduce Jobs from client applications Talks to NameNode to determine data location Locates available TaskTracker Node Submits the work to the chosen TaskTracker Node

TASKTRACKER Hadoop v1 A TaskTracker node accepts map, reduce or shuffle operations from a JobTracker Its configured with a set of slots, these indicate the number of tasks that it can accept JobTracker seeks for the free slot to assign a job TaskTracker notifies the JobTracker about job success status. TaskTracker also sends the heartbeat signals to the job tracker to ensure its availability, it also reports the no. of available free slots with it.

HADOOP WEB INTERFACES MapReduce Job Tracker Web Interface http://localhost:50030/ Task Tracker Web Interface http://localhost:50060/ HDFS Name Node Web Interface http://localhost:50070/

HADOOP STACK

Hadoop Ecosystems Project

HADOOP ECOSYSTEMS PROJECTS Hadoop Ecosystem Projects includes: Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel computation. ZooKeeper: coordination service for distributed applications.

Hadoop releases Available Versions 1.2.X - current stable version, 1.2 release 2.2.X - current stable version, 2.x release 2.4.x current beta release 0.2.3.x similar to 2.x.x but missing NN HA.

Release Lifecycle

Notable editions Hadoop 2.0 Map Reduce process restructured with YARN NameNode HA: automated failover with a hot standby and resiliency for the NameNode master service. Snapshots: point-in-time recovery for backup, disaster recovery and protection against use errors. Federation: allow for multiple namenode to manage namespace for a Hadoop cluster. With HDFS federation, multiple Namenode servers manage namespaces and this allows for horizontal scaling, performance improvements, and multiple namespaces. Speed improvement in SQL query language HIVE

NextGen MapReduce (YARN) The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker and TaskTracker into separate entities. In Hadoop 2.0, the JobTracker and TaskTracker no longer exist and have been replaced by three components: ResourceManager: a scheduler that allocates available resources in the cluster amongst the competing applications. NodeManager: runs on each node in the cluster and takes direction from the ResourceManager. It is responsible for managing resources available on a single node. ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs a specific YARN job and is responsible for negotiating resources from the ResourceManager and also working with the NodeManager to execute and monitor Containers. The actual data processing occurs within the Containers executed by the ApplicationMaster. A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host. Ref: Strata (Rich Raposa)

Yarn functions

MapReduce Master/slave Architecture (hadoop 1)

YARN Master/slave ARCHITECTURE (Hadoop 2)

References http://hadoop.apache.org/ http://wiki.apache.org/hadoop/ http://hadoop.apache.org/core/docs/c urrent/hdfs_design.html http://wiki.apache.org/hadoop/faq