Big Data With Hadoop



Similar documents
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

CSE-E5430 Scalable Cloud Computing Lecture 2

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data-Intensive Computing with Map-Reduce and Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Large scale processing using Hadoop. Ján Vaňo

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Parallel Processing of cluster by Map Reduce

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Internals of Hadoop Application Framework and Distributed File System

Apache Hadoop. Alexandru Costan

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Chapter 7. Using Hadoop Cluster and MapReduce

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop Distributed File System. Jordan Prosch, Matt Kipps


Hadoop & its Usage at Facebook

Application Development. A Paradigm Shift

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Apache Hadoop FileSystem and its Usage in Facebook

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Distributed File Systems

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Introduction to Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache HBase. Crazy dances on the elephant back

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Processing with Google s MapReduce. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

NoSQL and Hadoop Technologies On Oracle Cloud

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop IST 734 SS CHUNG

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data and Apache Hadoop s MapReduce

Cloud Computing at Google. Architecture

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Survey on Scheduling Algorithm in MapReduce Framework

Hadoop Architecture. Part 1

Map Reduce / Hadoop / HDFS

Hadoop & its Usage at Facebook

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Introduction to Hadoop

Distributed File Systems

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Technology Core Hadoop: HDFS-YARN Internals

MapReduce with Apache Hadoop Analysing Big Data

A very short Intro to Hadoop

Fault Tolerance in Hadoop for Work Migration

Unified Big Data Processing with Apache Spark. Matei

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Deploying Hadoop with Manager

How To Scale Out Of A Nosql Database

Hadoop Distributed File System (HDFS) Overview

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Apache Hadoop FileSystem Internals

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop. Sunday, November 25, 12

Hadoop & Spark Using Amazon EMR

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A Brief Outline on Bigdata Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

<Insert Picture Here> Big Data

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Processing of Hadoop using Highly Available NameNode

Big Application Execution on Cloud using Hadoop Distributed File System

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Alternatives to HIVE SQL in Hadoop File Structure

Transcription:

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016

Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5

Source: [Tutorials Point] Saurabh Singh CSE 5194, Spring 2016

Source: [Tutorials Point] Continuous scaling of traditional database servers fails to meet these requirements! Saurabh Singh CSE 5194, Spring 2016

Types of Structured data : Relational data Semi Structured data : XML data Unstructured data: Word, PDF, Text, Media Logs Challenges with Capturing data Storage Searching Sharing Transfer Analysis Presentation

What is? Open-source library for data-intensive distributed applications based on the framework Allows distributed processing of large datasets across clusters of computers using simple programming models Designed to scale up from single server to thousands of machines, each offering local computation and storage Started in 2005 Currently, a registered trademark of the Apache Software Foundation

Overview Distributed file system, which stands for Distributed Filesystem Designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware Complex Issues Chances of data loss due to machine failures Complication of network programming because it is network based file system

Master/Slave architecture Source: [ Guru]

NameNode Block Placement One replica on local node, second replica on remote rack, third replica on same remote rack and additional replicas (if replication factor > 3) are randomly placed Client read from nearest replica NameNode Failure A single point of failure Transaction logs (EditLogs) are stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS).

NameNode EditLogs The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage FsImage and EditLog are stored as a file in the NameNode s local file system Secondary NameNode Copies or Merges FsImage and Transaction Log from NameNode to a temporary directory Uploads new FsImage to NameNode and transaction log on NameNode is purged

Heartbeats Data Nodes sends heartbeats to NameNode every 3 seconds NameNode uses heartbeats to detects datanode failure Replication Engine NameNode detects DataNode failures. NameNode: Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes

Storing & Querying Source: [ Guru]

Good usecases for Store large datasets which may be in TB s or PB s or even more Store different variety of data - Structured, Unstructured or Semi-Structured Store data on commodity hardware (Economical) Bad usecases for Low latency data access Huge number of small files Random file access

What is? is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner Requirements Provide simple and powerful programming model Use large clusters of commodity machines Isolate the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance

What is? is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner Requirements Provide simple and powerful programming model Map and reduce paradigm Use large clusters of commodity machines Scale horizontally instead of scaling vertically Isolate the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance Through redundancy and re-execution

Programming model The computation takes a set of key/value pairs input and produces a set of key/value pairs as output. The user of the framework expresses the computation using two functions: Map and Reduce. The Map function takes an input pair and produces a set of intermediate key/value pairs The framework groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function The Reduce function receives an intermediate key I with its set of values and merges them together

Execution details Source: [Jeffrey Dean, Sanjay and Ghemawat]

1 Input data split into M pieces and many instances of the program started 2 One of the instances is the master copy while the rest are considered as workers. In particular, there are M map tasks and R reduce tasks to assign. 3 A worker who is assigned a map task processes the contents of the corresponding input split and generates key/value pairs from the input data and passes each pair to the user-defined Map function 4 Periodically, the buffered pairs are written to local disk and partitioned into R regions by the partitioning function 5 Reduce worker reads the buffered data from the local disks of the map workers, which is then sorted by the intermediate keys so that all occurrences of the same key are grouped together

6 The reduce worker passes the key and the corresponding set of intermediate values to the user s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition 7 When all map tasks and reduce tasks have been completed, the master program returns control to the user program

Cluster Source: [Edureka!] Saurabh Singh CSE 5194, Spring 2016

JobTracker Responsible for taking in requests from a client and assigning TaskTrackers with tasks to be performed Tries to assign tasks to the TaskTracker on the DataNode where the data is locally present If the node fails, assigns the task to another TaskTracker where the replica of the data exists TaskTracker Accepts tasks (Map,Reduce and Shuffle) from the JobTracker. Sends heart beat message periodically to JobTracker to notify that it is alive. Also sends free slots available within it to process tasks Starts and monitors the Tasks and sends progress/status information back to the JobTracker

Example Code Source: [Pietro Michiardi]

Example Execution Source: [Xiaochong Zhang]

Key Features Low-Cost Unreliable Commodity Hardware Extremely Scalable Fault Tolerant Easy to Administer Highly Parallel

JobTracker related issues in Limits scalability: JobTracker responsible for all of Resource management Job and task scheduling Monitoring Availability: if JobTracker fails, all jobs must restart Problem with Resource Utilization: DataNodes may be reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots Limitation in running non- Application Problem in performing real-time analysis Problem in running Message-Passing approach Problem in running Ad-hoc query

Source: [Saphana Tutorial]

Source: [Saphana Tutorial] Saurabh Singh CSE 5194, Spring 2016

Improvements with Yarn does efficient utilization of the resource There are no more fixed map-reduce slots Can now run multiple applications in, all sharing a common resource Can even run application that do not follow model Backward compatibility No more JobTracker and TaskTracker needed in 2.0 Instead, we have two daemons Resource Manager Node specific Node Manager

What is? is a distributed column-oriented database built on top of the Horizontally scalable is a data model that is similar to Google s BigTable Designed to provide quick random access to huge amounts of structured data Leverages the fault tolerance provided by the Provides random real-time read/write access to data in

Reading/Writing Data Source: [Tutorials Point] One can store the data in either directly or through Data consumer reads/accesses the data in randomly using sits on top of the and provides read and write access

vs DFS suitable for storing large files Does not support fast individual record lookups Provides high latency batch processing; no concept of batch processing Provides only sequential access of data Database built on top of the Provides fast lookups for larger tables Provides low latency access to single rows from billions of records (Random access) Internally uses Hash tables and provides random access, and it stores the data in indexed files for faster lookups

What is? is data warehouse infrastructure tool to process structured data in Resides on top of to summarize, and makes querying and analyzing easy Initially was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache

Features of It stores schema in a database and processed data into It provides SQL type language for querying called QL or HQL It is familiar, fast, scalable, and extensible

Features of It stores schema in a database and processed data into It provides SQL type language for querying called QL or HQL It is familiar, fast, scalable, and extensible Not a relational database Not a language for real-time queries and row-level updates

Source: [Tutorials Point]

What is? is used to analyze large sets of data representing them as data flows All the data manipulation operations can be performed in using Provides a high-level language known as Latin Various operators are available using which programmers can develop their own functions for reading, writing, and processing data It was originally created at Yahoo

Features of Latin is a procedural language The data model in Apache is nested relational Allows splits in the pipeline Allows developers to store data anywhere in the pipeline Provides operators to perform ETL (Extract, Transform, and Load) functions

Features of Latin is a procedural language The data model in Apache is nested relational Allows splits in the pipeline Allows developers to store data anywhere in the pipeline Provides operators to perform ETL (Extract, Transform, and Load) functions Provides limited opportunity for query optimization Schema is optional. We can store data without designing a schema

Source: [Tutorials Point]

vs Apache uses a language called Latin Latin is a data flow language Latin is a procedural language and it fits in pipeline paradigm Apache can handle structured, unstructured, and semi-structured data uses a language called QL QL is a query processing language QL is a declarative language is mostly for structured data

Requirements Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Requirements Support those applications which need to reuse a working set of data across multiple parallel operations while retaining the scalability and fault tolerance of. iterative machine learning algorithms. interactive data analytics. Be compatible with,, and any storage system. Source: [Lisa Hua]

Ecosystem Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Lisa Hua]

Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Resilient Distributed Datasets (RDDs) What are RDDs? Read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost Users can explicitly cache an RDD in memory across machines and reuse it in multiple -like parallel operations RDDs achieve fault tolerance through a notion of lineage each RDD object contains a pointer to its parent and information about how the parent was transformed. Hence, if a partition of an RDD is lost, the RDD has sufficient information about how it was derived from other RDDs to be able to rebuild just that partition

Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Resilient Distributed Datasets (RDDs) Construction of RDDs From a file in a shared file system Parallelizing a collection (e.g., an array) transforming an existing RDD changing the persistence(by Cache or Save) of an existing RDD Parallel operations on RDDs The reduce operation combines dataset elements using an associative function to produce a result The collect operation sends all elements of the dataset to the program The foreach passes each element through a user-provided function

Example Code Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Lisa Hua]

vs Mapreduce Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Tudor Lapusan]

Amazon Web Services https://aws.amazon.com/elasticmapreduce/ Cloudera https://cloudera.com/products/apache-hadoop.html Hortonworks http://hortonworks.com/hdp/ IBM http://www.ibm.com/analytics/us/en/technology/hadoop/ MapR https://www.mapr.com/products/apache-hadoop

Jeffrey Dean, Sanjay and Ghemawat : Simplified Data Processing on Large Clusters http://static.googleusercontent.com/media/research.google.com/ zh-cn/us/archive/mapreduce-osdi04.pdf Lisa Hua Overview http: //web.cse.ohio-state.edu/ panda/5194/papers/4o_spark_overview.pdf Saphana Tutorial http://saphanatutorial.com/ how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/ Tutorials Point Hbase Overview http://www.tutorialspoint.com/hbase/hbase_overview.htm

Guru http://hadoopguru.blogspot.com/2013/02/ hadoop-distributed-file-system-hdfs.html Core Servlets Tutorial http://www.coreservlets.com/hadoop-tutorial/ Tutorials Point Overview http://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm Tutorials Point Introduction http://www.tutorialspoint.com/hive/hive_introduction.htm

Tutorials Point http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm Xiaochong Zhang Work Structure http://xiaochongzhang.me/blog/wp-content/uploads/2013/05/_ Work_Structure.png Tudor Lapusan Vs http://www.slideshare.net/tudorlapusan/map-reduce-vs-spark Pietro Michiardi Scalable Algorithm Design with Mapreduce http://www.slideshare.net/michiard/ scalable-algorithm-design-with-mapreduce

Edureka! Cluster http://www.slideshare.net/edurekain/ hadoop-20-architecture-hdfs-federation-namenode-high-availability

The End